Use of subword tokenization for domain generation algorithm classification

Cybersecurity

Table 1 A summary of recent ML methods for DGA detection and classification

Detection or classification problem	Features	ML models	Dataset	Results
Detection (Almashhadani et al. 2020)	Lexical features	DT, SVM, kNN	85,000 benign 85,000 DGA from 20 classes	F1 scores: 0.9437 (DT) 0.9411 (SVM) 0.9443 (kNN)
Detection (Wang et al. 2022)	Distance-based features (KL distance, edit distance, Jaccard index)	SVM, NN	10,000 benign, 10,000 DGA from 12 classes	Accuracy close to 1
Classification (Vranken and Alizadeh 2022)	TF-IDF of the n-grams in domain names	SVM, MLP, RF, DT, kNN	583,9543 benign 492,800 DGA from 57 classes	F1 scores: 0.7573 (SVM) 0.7759 (MLP) 0.6284 (RF) 0.6443 (DT)
Detection and Classification (Zago et al. 2020a, b)	Lexical features	Adaboost, NN, RF, SVM, DT, kNN	10,000 benign 50 DGA classes, each has 10,000	F1 scores: Detection 0.556–0.989 Classification 0.297–0.769
Detection and Classification (Cucchiarelli et al. 2021)	n-gram features	MLP	10,000 benign 50 DGA classes, each has 10,000	F1 scores: Detection 0.964 Classification 0.823