From: Use of subword tokenization for domain generation algorithm classification
Detection or classification problem | Features | ML models | Dataset | Results |
---|---|---|---|---|
Detection (Almashhadani et al. 2020) | Lexical features | DT, SVM, kNN | 85,000 benign 85,000 DGA from 20 classes | F1 scores: 0.9437 (DT) 0.9411 (SVM) 0.9443 (kNN) |
Detection (Wang et al. 2022) | Distance-based features (KL distance, edit distance, Jaccard index) | SVM, NN | 10,000 benign, 10,000 DGA from 12 classes | Accuracy close to 1 |
Classification (Vranken and Alizadeh 2022) | TF-IDF of the n-grams in domain names | SVM, MLP, RF, DT, kNN | 583,9543 benign 492,800 DGA from 57 classes | F1 scores: 0.7573 (SVM) 0.7759 (MLP) 0.6284 (RF) 0.6443 (DT) |
Lexical features | Adaboost, NN, RF, SVM, DT, kNN | 10,000 benign 50 DGA classes, each has 10,000 | F1 scores: Detection 0.556–0.989 Classification 0.297–0.769 | |
Detection and Classification (Cucchiarelli et al. 2021) | n-gram features | MLP | 10,000 benign 50 DGA classes, each has 10,000 | F1 scores: Detection 0.964 Classification 0.823 |