Use of subword tokenization for domain generation algorithm classification

Domain name generation algorithm (DGA) classification is an essential but challenging problem. Both feature-extracting machine learning (ML) methods and deep learning (DL) models such as convolutional neural networks and long short-term memory have been developed. However, the performance of these approaches varies with different types of DGAs. Most features in the ML methods can characterize random-looking DGAs better than word-looking DGAs. To improve the classification performance on word-looking DGAs, subword tokenization is employed for the DL models. Our experimental results proved that the subword tokenization can provide excellent classification performance on the word-looking DGAs. We then propose an integrated scheme that chooses an appropriate method for DGA classification depending on the nature of the DGAs. Results show that the integrated scheme outperformed existing ML and DL methods, and also the subword DL methods.


Introduction
Network attacks are common nowadays.A botnet is a group of Internet-connected devices such as Internetof-Things (IoTs) and computers that are controlled by third-party software to perform specific tasks without the knowledge of device owners (Negash and Che 2015;Kambourakis et al. 2019).The instructions are sent from a "command and control" (C&C) server.In the past, the C&C server had a fixed IP address or domain name for communicating with the bots.However, such communication can be blocked easily by approaches such as a blacklist.To better avoid detection, recent bots and C&C servers employ domain name generation algorithms (DGA) for communication (Vormayr et al. 2017).The algorithm generates a large set of domain names, but only a few are registered.The bots would try resolving these domain names successively until they are able to connect to the C&C server.By using this approach, the C&C server is hidden which makes its detection difficult.
There are different DGA generation schemes such as hash-based, arithmetic-based, wordlist-based, and permutation-based methods (Wang and Guo 2021;Wang et al. 2020).The hash-based and arithmetic-based methods produce random-looking domain names such as "ukphbhncsdpgo.com"that appear to be different from legitimate domains.The wordlist-based and permutation-based methods produce word-looking domain names by joining or permuting words from a dictionary to create domain names.Several approaches have been proposed for detecting and classifying these domain names, including machine learning-based and deep learning-based methods (Almashhadani et al. 2020;Berman 2019;Cucchiarelli et al. 2021;Qiao et al. 2019;Ren et al. 2020;Selvi et al. 2021;Vij et al. 2020;Vranken and Alizadeh 2022;Wang et al. 2022;Hoang and Vu 2022).A recent review paper (Saeed et al. 2021) provides an overview of these methods.Detection involves determining whether a domain name is generated by a DGA scheme, whereas classification aims to identify the DGA method used to generate the domain name.As reported in Zago et al. 2020a, binary detection achieves over 90% accuracy, while multi-class classification achieves around 70%.This indicates that identifying the DGA method used to generate domain names is more challenging than distinguishing them from legitimate ones (Cucchiarelli et al. 2021;Ren et al. 2020;Vij et al. 2020).Especially some classes have extremely low classification accuracy.Therefore, there is much room for further improvement in multi-class classification.
In this paper, we study the classification performance of existing approaches on random-looking and wordlooking domains.In machine learning-based approaches, features about the character distributions in the domain names have been used.Examples include the vowels and consonant ratios or the numerals and English characters ratios (Vranken and Alizadeh 2022).Most of the extracted features were primarily targeted at randomlooking domains and are effective in characterizing the randomness in the domain names.It, however, resulted in inferior performance in characterizing word-looking domain names (Ren et al. 2020).In contrast, existing deep learning methods are more effective in characterizing word-looking domain names.To further improve the existing deep learning methods, we propose using subword tokenization to study the domain names.The rationale is that subword tokenization is better suited for studying relationships among connecting words than character tokenization (Liew and Law 2022).
In addition, we propose an integrated scheme containing two classifiers.One classifier focuses on characterizing the random-looking DGA while the other focuses on word-looking domains.We also develop a metric to distinguish between random-looking DGA and wordlooking domains.In summary, we study the following research questions in this paper: • Are there any performance differences between existing machine learning (ML) and deep learning (DL) approaches in DGA multi-class classification?Which type of domain names can be more accurately classified by ML? Which type of domain names can be more accurately classified by DL? • Is subword tokenization better than character tokenization in characterizing word-looking domain names?• How can we determine whether ML or DL is better suited for detecting the class of a given DGA? • Can we combine ML and DL approaches for achieving better DGA multi-class classification performance?
The main contributions of this paper are: • We examine the effectiveness of integrating subword tokenization into DL models to characterize the word-looking DGAs.This represents an improvement over existing DL approaches which primarily use character tokenization only.• We examine the performance difference between ML and DL in DGA multi-class classification and find out the types of DGA better classified by ML and DL.An algorithm is developed to determine whether a testing DGA is better classified by ML or DL.• A scheme is developed to integrate the advantages of feature-extracting ML and subword tokenization DL in DGA multi-class classification.
The remainder of this article is organized as follows.First, we give an overview of the problem and compare the performance of the existing state-of-the-art machine learning and deep learning methods in "Background" section.Next, we describe the proposed subword tokenization for building a deep learning model in "The proposed subword-based deep learning model (SW-CNN and SW-LSTM)" section."Proposed integrated scheme" section then shows the integrated scheme and how we quantify the nature of the DGAs.Experimental results are given in "Experimental results" section.Finally, we conclude our work in "Conclusions" section.

Background
There are two problems associated with DGA: DGA detection and DGA classification (Zago et al. 2020a).DGA detection is a binary classification problem.The aim is to distinguish if the domain name is legitimate or algorithmically generated.The DGA classification is a multi-class classification problem.It aims to further classify the domain names into several types of DGA methods that have been used to generate them.Both machine learning (ML) and deep learning (DL) methods have been used for DGA detection and classification (Almashhadani et al. 2020;Berman 2019;Cucchiarelli et al. 2021;Qiao et al. 2019;Ren et al. 2020;Selvi et al. 2021;Vij et al. 2020;Vranken and Alizadeh 2022;Wang et al. 2022).Interested readers may refer to paper (Saeed et al. 2021) for a recent survey of ML and DL methods for DGA detection.
In ML methods, feature extraction is a crucial step to characterize the nature of the domain names.Features representing expert knowledge are extracted from the domain names to define the algorithmically generated domains' characteristics.Common features can be divided into statistical features, information theory features, and lexicographic features.For example, vowel-consonant ratio, n-gram distributions, the longest consecutive consonant/number/vowel sequences, pronounceability score, and entropy (Almashhadani et al. 2020;Antonakakis et al. 2012;Bilge et al. 2014) have been used to characterize the nature of the algorithmically generated domain names.A detailed list of features can be found in Vranken and Alizadeh (2022).
After features are extracted, various ML models have been used for DGA detection and classification.Table 1 provides a summary of recent methods, including the features used, the ML models, the dataset sizes, and the detection/classification results (Almashhadani et al. 2020;Cucchiarelli et al. 2021;Vranken and Alizadeh 2022;Wang et al. 2022;Zago et al. 2020a).Generally, DGA detection has a much better performance than its classification.For DGA detection, the accuracy or the F1 score is always higher than 0.95 for all the methods.For DGA classification, the F1 score is much lower than that.It is between 0.297 and 0.823.We can see the difficulty in DGA classification.The extracted features can distinguish DGA from legitimate domains.However, they are not sufficient in characterizing different DGA classes.Further research on feature extraction should be carried out to improve the multi-class classification performance.
Deep learning approaches have also been used for DGA detection and classification.Unlike machine learning, no feature extraction is done.Rather, the domain name is considered a string of characters.With the use of a sufficient number of examples, a learning model is trained to distinguish and characterize the DGA.In this way, there is no need for manual feature extraction.As domain names consist of characters, tokenization, and embedding are required to convert the domain names to numerical sequences.In literature, the most popular method is to use character tokenization in which a domain name is decomposed into a sequence of characters.These characters are encoded independently and mapped to integers in the embedding layer.The resultant numerical sequences are then fed into the DL models for training.
Convolutional neural networks (CNN) and long short-term memory (LSTM) are popular DL models.LSTM is often used for acquiring patterns in long sequences in different applications (Qiao et al. 2019;Selvi et al. 2021;Vij et al. 2020;Woodbridge et al. 2016;Xu et al. 2022).For CNN, a filter kernel with varied sizes is used to characterize sequence relationships (Berman 2019;Feng et al. 2017;Yu et al. 2017).CNN can also be combined with LSTM to improve detection and classification performance (Ren et al. 2020;Mac et al. 2017).Table 2 provides a summary of recent methods, including the model structures, the dataset sizes, and the detection/classification results.Like the ML results, classification is more challenging than detection.In DGA detection, the F1 score (in Eq. ( 4)) is always higher than 0.97.However, in terms of classification, the score can be as low as 0.7 for some datasets.There is much room for further improvement in multiclass classification.Also, in most of the experimental settings, the number of benign domain names is a lot more than the number of DGA in each class (Berman 2019;Qiao et al. 2019;Ren et al. 2020;Selvi et al. 2021; It is important to identify the advantages and limitations of existing approaches in DGA classification.There are two types of DGAs, namely the random-looking DGA and the word-looking domain names (Selvi et al. 2021).We will study if these ML and DL models have similar performance on both DGA types.Based on the published classification results, a summary of current state-of-theart results is made in Table 3.
In most of the experimental setups, there are more random-looking DGAs than word-looking DGAs.Table 3 shows that most setups use 10% or fewer word-looking DGAs to build the model.If the number of word-looking DGAs is small, the training in the DL model may not be able to characterize these word-looking DGAs very well.If there are more word-looking DGAs, DL models like BiLSTM and CNN can provide a good characterization for these domains.As in Cucchiarelli et al. (2021) which contains 11 word-looking DGA classes, the F1 score for word-looking DGAs is higher than that for random-looking DGAs.
In general, machine learning-based methods can characterize the random-looking DGAs better than wordlooking DGAs.This is consistent with findings from other authors (Ren et al. 2020).Despite that, an exception is the n-gram features.As the n-gram features capture specifically the relationship among consecutive characters distribution, it models the word relationship better than other types of manual features, such as vowel-consonant ratios.
Based on the summary, we see there is a performance difference between existing ML and DL approaches.When more word-looking DGAs are used for training, DL is better for word-looking DGAs than randomlooking DGAs.By combining the use of ML and DL, the classification performance of word-looking and randomlooking domain names can be improved.For the existing DL methods, character tokenization is used.Since n-gram is good for classifying word-looking domains, inspired by this idea, we investigate if subword tokenization can be used to better capture the characteristics of word-looking DGAs.

The proposed subword-based deep learning model (SW-CNN and SW-LSTM)
While existing character tokenization can, to a certain extent, characterize the relationship among connecting tokens, word-based tokenization can better preserve the linguistic and semantic structure (Liew and Law 2022) in the word-looking DGAs.However, the randomlooking DGA does not have any semantic meaning and thus word-based tokenization would not be able to give any meaningful representation.Hence, we consider a subword tokenization method to tokenize the domain names.Words that can be found in a dictionary are formed as tokens, while other random parts are decomposed into characters for tokenization.
Figure 1 shows the schematic diagram of our proposed subword-based DL model.The model contains two branches for characterizing the domain names, one branch considers character tokenization, and the other branch considers subword tokenization.These two branches are merged at the end to achieve a better domain name classification.Note that each URL will go through both branches so that both character and word relationships can be extracted.
The relevant parts of the domain names will first be extracted from the URLs in the pre-processing block.The domain names are extracted as follows (Yu et al. 2017).If the URL contains a second-level domain name, the second-level part is extracted.If it is a third-level domain name, the second-level domain name is checked to see if it is from a popular dynamic domain name service such as "no-ip.com","dnsdynamic.org"or "ddns.net".If so, the third-level domain part is extracted.If not, the longer string from the second-level and the third-level domain name is extracted.For example, the URL "ab1cf5d50e7da6.com"will be processed to give "ab1cf5d50e7da6", while "akboavenifbiuc.ddns.net"produces "akboavenifbiuc" only.
The output from the pre-processing block will be fed into two branches for further analysis.The first branch uses character tokenization while the second uses subword tokenization.In character tokenization, each character is encoded independently.For example, the tokens for "shopee.ph"are {'s' , 'h' , 'o' , 'p' , 'e' , 'e'}.Each character token is then mapped to an integer.The character set includes the English alphabet, numbers, and special characters.For example, the above token may become {19, 8, 15, 16, 5, 5} after encoding.The number of tokens varies with different domain names.However, inputs to the deep learning models need to be uniform with equal  lengths.Padding is thus required so that all resultant embedding vectors have uniform lengths.The steps for subword tokenization are like those for character tokenization, except the subword replaces the character as tokens.For example, the URL "shopee.ph"becomes "shopee" after pre-processing.Using subword tokenization, tokens are {'shop' , 'ee'}.The tokens preserve common words that can be found in a dictionary.Some more examples are shown in Table 4.
For random-looking DGAs such as those in the "ramnit" class, the extracted subwords are short in length.However, for word-looking DGA, the extracted subwords are English words carrying semantic meaning.Thus, the subsequent deep learning model can be used to characterize the relationship among the connecting words.
As discussed in "Background" section, both convolutional neural networks (CNN) and Bi-directional Long Short-Term Memory (Bi-LSTM) have been used for DGA detection and classification in the literature (Berman 2019; Qiao et al. 2019;Ren et al. 2020;Selvi et al. 2021;Vij et al. 2020;Woodbridge et al. 2016;Feng et al. 2017;Yu et al. 2017;Mac et al. 2017).These two models will be investigated in this study to model the word relationships.The proposed structure is shown in Fig. 1.We called our proposed subword models using CNN and Bi-LSTM as SW-CNN and SW-LSTM respectively.In summary, the proposed models capture the relationships between words and characters that are combined to generate the final DGA classification.

Proposed integrated scheme
Subword-based DL methods (SW-CNN and SW-LSTM) produce meaningful representations for word-looking DGA.However, they may not be the best for characterizing random-looking DGA.It is thus advantageous to use an integrated scheme so that an appropriate model can be adopted for classifying different types of DGAs. Figure 2 shows the integrated scheme which contains two classifiers, the ML model and the DL model.In the training phase, both the ML and DL models are trained.The ML model can be either a random forest, XGBoost, or other classifiers that characterizes the random-looking DGA well.The DL model is either the proposed SW-CNN or SW-LSTM which focuses on characterizing word-looking DGA.In the testing phase, a randomness indicator is obtained which classifies the domain names into either the random-looking or word-looking type so that an appropriate model is adopted for final classification.By using this proposed "a", "n", "i", "m", "a", "l", "f", "o", "r", "g", "e", "t" Subword tokenization: "animal", "for", "get" Word-looking rovnix thesetobewarfarebecomes.net Character tokenization: "t", "h", "e", "s", "e", "t", "o", "b", "e", "w", "a", "r", "f", "a", "r", "e", "b", "e", "c", "o", "m", "e", "s" Subword tokenization: "these", "to", "be", "war", "fare", "be", "com", "es" Random-looking ramnit ukphbhncsdpgo.comCharacter tokenization: "u", "k", "p", "h", "b", "h", "n", "c", "s", "d", "p", "g", "o" Subword tokenization: "uk", "p", "hb", "hn", "cs", "dp", "go" integrated method, the DGA can be classified more appropriately depending on its nature.The randomness index, RIndex, is used to indicate the nature of the DGAs.It can be constructed by comparing the subword and character tokenization.In particular, the change of the number of tokens in subword tokenization with reference to the number of tokens in character tokenization can be employed.It is defined as, where N (Tokens char ) and N (Tokens subword ) denote respectively the number of tokens in the character tokenization and subword tokenization.RIndex must be larger than 0 as N (Tokens char ) ≥ N (Tokens subword ) .It is also smaller than 1.If the domain contains mostly words, the number of subword tokens would be small which thus gives a large RIndex.On the other hand, if the domain is random, the number of subword tokens and character tokens would be similar.In this case, RIndex would be small.As shown in Table 4, the RIndex of the word-looking DGAs is 0.75 and 0.65.In contrast, for the randomlooking DGA, it is 0.46 which is smaller in value.Hence, RIndex can indicate the nature of the domain names.

Experimental results
UMUDGA is a public dataset designed for profiling algorithmically generated domain names in botnet detection.It has a collection of over 30 million domain names from 50 DGA classes (Zago et al. 2020a, b).Out of the 50 classes, 39 produce random-looking domain names, and 11 produce word-looking domain names.In our experimental testing, multi-class DGA classification is considered.To construct the dataset, 10,000 legitimate domain names and 10,000 algorithmically generated domain names for each of the 50 classes are collected.The multi-class classification problem becomes classifying the domain names into one of the 51 classes.This setting is the same as that in Cucchiarelli et al. (2021) and Zago et al. 2020a).
To evaluate and compare the performance, the precision, recall, and F1 scores are used.They are defined as follows, (1) where TP, FP, and FN stand for true positive, false positive, and false negative respectively for each class.For a class A, TP is the number of samples that are in A and are identified as A. FP is the number of samples that are not in A but are identified as A. FN is the number of samples that are in A but being identified as not in A. Precision tells the percentage of samples that are correctly classified in the classification results.Recall means the percentage of samples in a class that can be correctly classified.In a perfect classification, both precision and recall are 1.
In practice, increasing precision may decrease recall.F1 score is thus used to quantify the performance.It is the weighted mean of precision and recall.We will first compare the performance of subword tokenization and character tokenization in characterizing the word-looking and random-looking DGAs.We will then evaluate the performance of our proposed integrated scheme.Additionally, we will study the effectiveness of using RIndex to identify the nature of the domain names.

Performance of subword tokenization in SW-CNN and SW-LSTM
In this part, we study the performance of using the subword tokenization in Fig. 1 as compared to the existing state-of-the-art approaches which use character tokenization only.Table 5 summarizes the performance of the proposed SW-CNN and SW-LSTM and compares it with the existing CNN (Ren et al. 2020) and LSTM (Cucchiarelli et al. 2021) approaches.
We can see that the use of subword encoding improves the overall classification performance.In CNN, the overall average accuracy improves from 0.7304 to 0.7589 by incorporating the subword encoding.For Bi-LSTM, the accuracy increases from 0.7304 to 0.7568.Hence, the subword encoding can help one performs a better DGA classification.
In order to have a further understanding of the classification performance, we examine the performance of the two models on both random-looking and word-looking DGAs.As shown in Table 5, the subword information is beneficial for characterizing word-looking DGAs.For both CNN and BiLSTM, the addition of the subword encoding can improve the F1 score, precision, and recall in the word-looking DGA significantly.It yields an improvement range of 9.59% to 13.64%.The average F1 for word-looking DGA improves from 0.8536 to 0.9364 in SW-CNN and 0.8345 to 0.9436 in SW-LSTM.Thus, subword tokenization is capable of modeling the word relationship which can be used to characterize the wordlooking DGA well.
While Table 5 shows the average performance, a boxplot is employed to show the distribution of the F1 score from each DGA class.Figure 3 shows the boxplot of the F1 score obtained from the proposed SW-CNN and SW-LSTM as compared to the existing CNN and LSTM approaches in both random-looking and word-looking DGAs.For random-looking DGA, the performance is similar no matter whether subword information is used or not.However, for word-looking DGA, the F1 score is significantly improved.As shown in the boxplot, the third quantile in the F1 score for both SW-CNN and SW-LSTM has been improved significantly by using the subword information.Results show that the performance of each of the word-looking DGA classes can be improved by using the proposed subword approaches.Boxplots for precision and recalls look similar and thus are not shown.

Performance of the integrated scheme
The integrated scheme requires the setting for RIndex.Figure 4 shows the boxplot of the distribution of the RIndex value on the word-looking and random-looking DGAs.We can clearly see that word-looking DGAs and random-looking DGAs have vastly different distributions on the RIndex value.The third quartile of the RIndex value in random-looking DGA is 0.4545 which is much smaller than the first quartile of the RIndex value in word-looking DGA (0.6471).Hence, by using the RIndex value, one can easily distinguish if the domain is random or is formed by concatenating words in the dictionary.In the experiment, we will set the threshold of the RIndex value to 0.55.
From "Background" and "Performance of subword tokenization in SW-CNN and SW-LSTM" sections, the performance of the models varies depending on the type of DGA being classified.In particular, the extracted  features in machine learning approaches (Zago et al. 2020a) perform better on the random-looking DGA while our proposed SW-CNN and SW-LSTM perform better on the word-looking DGA.The advantage of these two approaches is integrated into the proposed integrated scheme to achieve a better DGA classification.In the following, we will examine the performance of the integrated scheme using random forest (RF) and XGBoost.

Integrated scheme with random forest
In this sub-section, we consider integrating the random forest classifier with the proposed SW-CNN and SW-LSTM models.Table 6 shows the classification results.
The overall performance of the integrated scheme is better than the RF, SW-CNN, SW-LSTM, or CNN-BiLSTM.In the CNN case, the improvement of the integrated scheme over the RF, SW-CNN, and CNN-BiLSTM are 7.86%, 4.01%, and 3.90% respectively.For LSTM, the improvements are 8.01%, 4.44%, and 4.04% respectively.We also examine the performance of the random-looking and word-looking DGAs.We can see that the integrated approach is able to produce a good classification for both DGA types.For the word-looking DGAs, the integrated approach has significant improvement over the random forest approach as the word tokenization can better characterize the word relationship.For the random-looking DGAs, the performance of the integrated scheme matches that of the feature-extracting ML approach.As shown in the boxplot in Fig. 5, some random-looking DGA classes have bad classification performance in the SW-CNN and SW-LSTM.However, the integrated approach can improve these classes because our integrated scheme chooses the ML approach for classification.Thus, the overall classification performance of the random-looking DGAs in the integrated approach is better than the SW-CNN and SW-LSTM.

Integrated scheme with XGBoost
In this sub-section, we consider integrating the XGBoost classifier with the proposed SW-CNN and SW-LSTM models.Table 7 shows the classification results and Fig. 6 shows the boxplot.Similar to the case of the random forest, the integrated scheme performs better than the individual classifiers as well as the CNN-BiLSTM which uses character tokenization.By comparing the integrated scheme of the random forest (Table 6) and the XGBoost (Table 7), the performance of the XGBoost is slightly better than that of the random forest.

Fig. 1
Fig. 1 The schematic diagram of the proposed SW-CNN and SW-LSTM DL models

Fig. 2
Fig. 2 The proposed integrated scheme, a training phase, and b testing phase

Fig. 3
Fig. 3 The boxplots of the F1 score for a the random-looking DGAs and b the word-looking DGAs.Note that SW-CNN and SW-LSTM denote our proposed subword DL models, while char-CNN and char-LSTM denote the CNN and LSTM models using character tokenization

Fig. 4
Fig. 4 The boxplot of the distribution of RIndex for word-looking and random-looking DGAs

Fig. 5 Fig. 6
Fig. 5 The box plots of the F1 score for a the random-looking DGAs and b the word-looking DGAs

Table 1
A summary of recent ML methods for DGA detection and classification

Detection or classification problem Features ML models Dataset Results
Vij et al. 2020).Hence, the overall result may not reflect the performance in each DGA class.

Table 2
A summary of DL methods for DGA detection and classification

Table 3
A summary of the F1 score for word-looking and random-looking DGAs from existing state-of-the-art DGA classification methods Bold values indicate the type of DGA that have a better result for each method W and R denote respectively the number of word-looking and random-looking classes

Table 4
Examples of character and subword tokenization for different DGA types

Table 5
A summary of the performance of the proposed subword DL models: SW-CNN and SW-LSTM and their character counterparts