Skip to main content

HSS: enhancing IoT malicious traffic classification leveraging hybrid sampling strategy

Abstract

Using deep learning models to deal with the classification tasks in network traffic offers a new approach to address the imbalanced Internet of Things malicious traffic classification problems. However, the employment difficulty of these models may be immense due to their high resource consumption and inadequate interpretability. Fortunately, the effectiveness of sampling methods based on the statistical principles in imbalance data distribution indicates the path. In this paper, we address these challenges by proposing a hybrid sampling method, termed HSS, which integrates undersampling and oversampling techniques. Our approach not only mitigates the imbalance in malicious traffic but also fine-tunes the sampling threshold to optimize performance, as substantiated through validation tests. Employed across three distinct classification tasks, this method furnishes simplified yet representative samples, enhancing the baseline models’ classification capabilities by a minimum of 6.02% and a maximum of 182.66%. Moreover, it notably reduces resource consumption, with sample numbers diminishing to a ratio of at least 83.53%. This investigation serves as a foundation, demonstrating the efficacy of HSS in bolstering security measures in IoT networks, potentially guiding the development of more adept and resource-efficient solutions.

Introduction

The proliferation of the IoT has profoundly transformed various facets of daily life, integrating a multitude of terminal devices such as cameras, sound boxes, and air conditioners into the Internet ecosystem, influencing several aspects of our daily lives (Al-Garadi et al. 2020; Zhou et al. 2020).

This evolution, though bringing a smarter and more convenient lifestyle, has escalated the vulnerability to privacy and security breaches, with smart devices becoming potential harbors for malicious traffic infiltration. The sophisticated network topology of IoT houses a diverse range of terminal devices, making the landscape of malicious traffic vast and complicated, often characterized by a distinct imbalance in data stemming from the varied frequency and impact of different types of attacks (Rajmohan et al. 2022).

Despite the remarkable advancements in traffic classification techniques, the unique challenges posed by the IoT malicious traffic have exposed the inadequacies of conventional methods, especially in striking a balanced compromise between classification capability and resource consumption, as seen in models such as AdaBoost and Random Forest, which require substantial resources and struggle to perform well in classifying small samples (Mahmud et al. 2020). Deep learning models have been a cornerstone for various classification tasks due to their impressive performance, particularly in image classification (Rezaei and Liu 2019). Nevertheless, the complexity and the high resource demand, such as demand of the Graphic Processing Unit (GPU), of these models poses significant hurdles, necessitating the development of classification methods that are not only resource-efficient but also feature strong interpretability to enhance the efficiency of Intrusion Detection Systems (IDS).

To address this gap, adopting the sampling strategies to reduce the redundancy and balance the distribution of training dataset in traffic analysis has attracted a lot of research attentions. Building upon this concept, our study embarks on the quest to devise an innovative data preprocessing method grounded on sampling principles, thereby facilitating enhanced interpretability and performance. A critical concern in optimizing this method is the identification of an optimal sampling threshold that ensures a fine equilibrium between model efficiency and resource expenditure, a focal point of our research endeavor. In this study, we propose the Hybrid Sampling Strategy (HSS), a novel approach adept at modulating the sampling threshold to achieve a harmony between model performance and resource utilization. Our pioneering method embodies a dual approach, incorporating undersampling to minimize the redundancy in abundant samples and an enhanced oversampling technique to amplify the size of data with scarce counts, thereby creating a more balanced and rational distribution of sample data. The main contributions of this paper are threefold.

  • We present an undersampling method founded on the concept of minimum sample thresholds. This method significantly reduces the abundance of redundant samples within the dataset, leading to an enhanced balance in the distribution of data samples on various IoT malicious traffic types.

  • We employ an enhanced oversampling technique to augment the sample size of data with low counts. This approach considers the specific feature types in IoT malicious traffic data, thereby creating a more balanced and rational distribution of sample data.

  • We propose the HSS data preprocessing approach. This method enhances the performance of fundamental machine learning models like AdaBoost, Random Forest, and others in the task of multi-classification for imbalanced IoT malicious traffic data. Moreover, it effectively curtails the model’s resource consumption.

The rest of this paper is organized as follows. Related work is briefly reviewed in “Related work” section. “Preliminaries” section defines the scope of the problem along with preliminaries. The architecture and details of the method proposed are presented in “Method design” section, and the evaluation results are demonstrated in “Evaluation” section. In the end, “Conclusion” section concludes this paper.

Related work

Due to its fundamental nature and heuristic significance, the sampling of imbalanced dataset has maintained continuous interest. In this section, we review the latest technique used in traffic classification, the key researches in imbalanced data preprocessing through sampling, and the use of sampling in tasks related to traffic classification.

The latest technique used in traffic classification

Approaches based on machine learning and deep learning techniques are widely used in traffic classification tasks, especially the deep learning based methods. Lotfollahi et al. (2020) devises a deep learning architecture based on Convolutional Neural Networks (CNN) and Long Short Term Memory (LSTM) networks to deal with encrypted traffic classification. Wang et al. (2017) fully considers the strong capability of CNN in image classification. It reshapes the traffic data bytes into a \([28 \times 28]\) image and feeds it into a customized deep learning architecture constructed by CNN. Yang et al. (2021) shows that CNN performs better than XGBoost in traffic classification. In contrast, Lichy et al. (2023) compares and finds that Random Forest, which is also an ML model like XGBoost, performs better in encrypted malware traffic classification. Besides CNN, Graph Neural Networks (GNN) have also caught researchers’ attention in recent years. Huoh et al. (2022) proposes a flow-based encrypted traffic classification GNN, which outperforms the CNN. Zhang et al. (2023) devises the Temporal Fusion Encoder using Graph Neural Networks (TFE-GNN) for feature extraction, and the TFE-GNN outperforms multiple state-of-the-art methods in fine-grained encrypted traffic classification tasks. Alghamdi and Bellaiche (2023) devises an ensemble deep learning for IoT using Lambda architecture to improve the defense capability of IDS. The researches above prove deep learning methods do have a good traffic classifying capability. However, almost all deep learning models require the support of the GPU, which makes the methods expensive to implement (Mittal 2016; Li et al. 2019; Mittal 2019; Mittal and Vaishay 2019).

Imbalanced data preprocessing through sampling

Undersampling and oversampling techniques have been employed to find the balance between the majority classes and minority classes in imbalanced datasets. Undersampling reduces the majority samples to alleviate imbalance and can be classified into three groups: selecting samples to keep, selecting samples to delete, and combining both approaches. The Condensed Nearest Neighbor Rule (CNNR) (Hart 1968; Li et al. 2018) is a representative method that works by selecting a subset of examples from the majority class that can represent the entire majority class. Tomek Links Undersampling (TLU) (Swana et al. 2022) is an instance of selecting samples to delete, which deletes the samples that form Tomek links. To enhance the undersampling performance, One-Sided Selection utilizes TLU followed by CNNR to eliminate the noisy samples (Cho et al. 2020). Oversampling increases the number of the minority samples to improve the imbalance problems. It can be categorized into two groups: duplicating existing samples and generating new synthetic samples. Random Over-Sampling Examples (ROSE) creates new samples by randomly sampling and replacing them with the currently available samples (Abdi and Hashemi 2015; Xiaolong et al. 2019). Unfortunately, ROSE may lead to overfitting, resulting in poor performance on new data. Compared with ROSE, the Synthetic Minority Oversampling Technique (SMOTE) is a more sophisticated and effective approach that synthesizes new examples from the minority samples based on the thought of k-neighbors (Chawla et al. 2002). To deal with different types of data, SMOTE has many variants, such as Borderline SMOTE, SVM SMOTE, K-means SMOTE, SMOTE for Nominal and Continuous features (SMOTENC), etc. Han et al. (2005), Tang et al. (2008), Douzas et al. (2018).

Use of sampling in traffic classification tasks

Though Deep Learning methods, which do not require sampling, are catching much attention in recent research, sampling methods remain practical in various traffic classification tasks. Du et al. (2021) proposes the Self-adaptive Probabilistic Sampling (SaPS) for classifying elephant flows, which alleviates both the time and memory consumption in elephant flow detection. Rust-Nguyen et al. (2023) compares different levels of SMOTE in the performance of machine learning models on darknet traffic classification and finds accuracy and F1-score improve with the appropriate level. Qin et al. (2023) devises the SD sampling method based on SMOTE to alleviate the imbalance of network traffic data.

Different from the aforementioned researches, our work aims to not only reduce the majority samples with undersampling according to the minimum sample class but also increase the minority samples by SMOTENC in consideration of the complicated feature in IoT malicious traffic data. The most related work is Chen et al. (2021) using Neighbor Cleaning Rule (NCL) to undersampling and using SMOTE to oversampling to balance the dataset, while this work is not focused on traffic classification.

Preliminaries

In this section, we briefly analyze the sampling demands and principles within imbalanced IoT malicious traffic data.

Hybrid sampling demand

For most traffic datasets, the traffic is divided into no more than ten categories. Additionally, most traffic classification methods employ a single strategy, such as feature selection, undersampling, or oversampling. However, the imbalanced IoT malicious traffic dataset consists of more than thirty traffic classes. Therefore, a single sampling method cannot sufficiently balance the data distribution or reduce resource consumption. Moreover, the ensemble models included in the baseline rendering feature selection are redundant. Accordingly, an impending demand is for an imbalanced data preprocessing method that can not only reduce the number of samples but alleviate the imbalance.

Principle of sampling method in imbalanced traffic data

Fig. 1
figure 1

Relationship between performance and sampling

Due to the uneven distribution of malicious traffic data and the need for interpretability in cybersecurity, sampling methods such as undersampling and oversampling are still commonly employed in traffic measurement, network attack, and defense. Undersampling and oversampling are techniques used to adjust the distribution of samples in a dataset to alleviate imbalances. Namely, undersampling reduces the number of large samples relative to small samples, while oversampling increases the number of small samples by generating artificial samples with the characteristics of the small samples. To assess the effectiveness of the aforementioned sampling strategies on the imbalanced IoT malicious traffic dataset, we conducted a feasibility validation experiment. Initially, we employed stratified random sampling to extract 5% of the samples from the entire dataset, creating the stratified sampling set \(S_{f1}\) to avoid unnecessary computational overhead. Subsequently, we applied the HSS to \(S_{f1}\), generating the hybrid sampling set \(S_{f2}\). Finally, we fed \(S_{f2}\) into a machine learning model, such as Adaboost, to evaluate the performance of HSS. Figure 1 not only illustrates the relationship between the comprehensive model performance, like F1-score, and sampling rate, but also provides empirical support for the viability of HSS. The sampling rate increases with the iteration times, and the model performance will increase until it reaches the local upper bound, then tend to decrease. Larger sampling thresholds don’t consistently translate to superior performance. Excessive threshold undersampling might not effectively address the imbalance, and aggressive oversampling of minority classes could introduce excess noise. The HSS proposed in this paper combines undersampling and oversampling to simultaneously decrease the number of big samples and increase the number of small samples by choosing an appropriate sampling threshold and validating, thereby creating a more rational sample distribution. As a result of this approach, the number of samples is effectively reduced, supporting an enhancement in the training speed of the models.

Method design

Method overview

We devise an imbalanced IoT malicious traffic data preprocessing method, HSS, which can improve the imbalance problem in traffic datasets leveraging hybrid sampling. Figure 2 shows the method architecture of HSS. The whole method consists of three steps: dataset partition, HSS core, and validation and test. In dataset partition, the IoT malicious traffic dataset is divided into three parts: training set, validation set, and testing set with an appropriate rate. Afterward, in HSS core, the sampling procedure is separated into undersampling and oversampling. Based on the imbalanced IoT malicious traffic dataset distribution, the minimum threshold sampling (MTS) approach is designed to find the sampling number of undersampling. Oversampling determines the number of samples for small samples in relation to big samples and utilizes SMOTENC to expand the small samples. Finally, in validation and test, HSS adjusts its undersampling threshold with the result of validations until the baseline models perform well on the validation set. HSS is compounded by undersampling utilizing MTS and oversampling utilizing SMOTENC.

Fig. 2
figure 2

Method architecture of HSS

Undersampling utilizing MTS

Undersampling techniques are used to reduce the number of large samples to shrink the gap between majority classes and minority classes, which may alleviate the imbalance in datasets. In this paper, we analyze the characteristic in the distribution of samples, and propose the MTS method. First, we divide the dataset into training set, validation set and testing set with the rate 8 : 1 : 1. Additionally, we set the number of the smallest samples as \(|N_{min}|\), the threshold \(t=a|N_{min}|\), where a is a positive integer can be adjusted to control the threshold. The sampling numbers of different classes are illustrated as

$$\begin{aligned} |N_{SA}| = \left\{ \begin{array}{ll} |N_{A}|, &{}\quad |N_{A}| < t\\ t, &{}\quad |N_{A}| \ge t, \end{array} \right. \end{aligned}$$
(1)

where \(|N_{SA}|\) is the sampling number of samples in class A, \(|N_A|\) is the number of total instances in class A. After the sampling procedure, the undersampling dataset \(S_{us}^{i}\) is generated, where i means i-th sampling, and us means the sampling method is undersampling. \(S_{us}^{i}\) is used to be the training set of baseline models. The value of the threshold t will be adjusted in accordance with the validated models trained with \(S_{us}^{i}\). The procedure of choosing the threshold will be done to enhance the models’ performance until their performance demonstrates a downward trend or until they reach the predetermined sampling limit. The sampling set, which has been generated, is represented as \(S_{us}=\{S_{us}^{1},S_{us}^{2},\ldots ,S_{us}^{m}\}\). The sampling set leading to the best models’ performance can be formulated as

$$\begin{aligned} \arg \max _{S_{us}^{i}\in S_{us}}{Score(S_{us}^{i})}, \end{aligned}$$
(2)

where \(Score(S_{us}^{i})\) is the synthetical score designed to describe the performance of the models trained with \(S_{us}^{i}\). \(Score(S_{us}^{i})\) is a weighted average of accuracy, recall, precision, and F1-score, which is illustrated by this formula: \(Score(S_{us}^{i})=0.2 \times Accuracy(S_{us}^{i})+0.3 \times Recall(S_{us}^{i})+0.2 \times Precision(S_{us}^{i})+0.3 \times F1(S_{us}^{i})\), in which recall and F1-score have a higher weight due to they can reflect the models’ comprehensive performance, for instance, the performance in small samples classification.

Oversampling utilizing SMOTENC

SMOTENC is one of the variants of SMOTE, which uses interpolation to generate small samples for balancing the distribution of sample types. It considers both numerical and categorical features to ensure that generated data is distributed more rationally. Since IoT malicious traffic data possesses a variety of features, including numerical and categorical ones, SMOTENC can be employed to generate more suitable data than other methods. SMOTE generates new samples with the thought of k-neighbors, whose procedure can be described as

$$\begin{aligned} x_{new} = x_{i} + \lambda \times (x_{zi} - x_{i}), \end{aligned}$$
(3)

where \(x_{i}\) is an original sample, \(x_{zi}\) one of the nearest k-neighbors of \(x_{i}\), \(\lambda\) is a random number in [0, 1], \(x_{new}\) is the generated sample. Compared with SMOTE, SMOTENC has two main differences (Gök and Olgun 2021). One is the nearest neighbors search does not rely on the Euclidean distance. The other one is a new sample is generated where each feature value corresponds to the most common category seen in the neighbor samples belonging to the same class. The principle of SMOTENC is illustrated as

$$\begin{aligned} x_{new} = x_{i} + \lambda \times (x_{zi} - x_{i}) + \delta , \end{aligned}$$
(4)

where \(\delta\) represents a bias vector that considers the categorical features. This bias vector plays a critical role in ensuring the alignment of synthetic samples with their neighboring counterparts. In the context of the malicious traffic dataset, categorical attributes such as FIN flag, ACK flag, etc., are carefully considered for the computation of \(\delta\). For a specific feature denoted as i, \(\delta _i\) is computed as a vector component of \(\delta\) and can be expressed as

$$\begin{aligned} \delta _{i} = f(c_i, N) - c_i. \end{aligned}$$
(5)

In this equation, \(c_i\) signifies the i-th categorical feature of the original sample \(x_{i}\) mentioned previously. The set N comprises the nearest k-neighbors of \(x_{i}\). Function \(f(c_i, N)\) returns the most common value of \(c_i\) observed within the set N. Notably, we choose the mode for categorical features as the most common value, as it aims to enhance the resemblance of synthetic samples to the original samples.

Similar to undersampling, oversampling needs a sampling threshold help to decide the sampling number. We use the nearest same order of magnitude approximation to find the oversampling threshold T, whose procedure can be illustrated as

$$\begin{aligned} T = c \times \left\lceil {\frac{|N_A|}{10^{\tau }}} \right\rceil \times 10^{\tau }, \end{aligned}$$
(6)

where \(c=1+0.2 \times \frac{|N_{max}|}{|N_A|}\) is the coefficient used to adjust the threshold by data distribution, \(|N_{max}|\) is the instances number of the most majority class in undersampling dataset, \(|N_A|\) means the number of \(N_A\), \(\tau = \lfloor \log {|N_A|} \rfloor\) represents the order of magnitude of \(|N_A|\).

In summary, the entire HSS process can be described as follows. Symbol a represents the current iteration number, while times signifies the total number of iterations determined by the user. \(score_{\text {last}}\) signifies the locally largest performance, and \(data_{s}\) is the corresponding dataset generated by HSS.

Algorithm 1
figure a

Illustration of HSS procedure.

Evaluation

In this section, we not only introduce the distribution of samples after sampling but evaluate the performance of baseline models trained on the dataset generated by HSS in resource consumption and classification capability.

Evaluation setup

Platform The system characteristics of the experimental platform are listed as follows. The operating system is Ubuntu 20.04 LTS installed on 8-core CPUs (1.9GHz-AMD Ryzen Core CPU R7-5800U), and the system memory is 16 GB. The processer has L1 cache with 512KB memory, L2 cache with 4MB memory and L3 cache with 16MB memory.

Dataset We choose CICIoT-2023 (Neto et al. 2023) as our experimental dataset. This dataset consists of 46,686,579 IoT malicious traffic instances, which have forty-five features and can be categorized into thirty-four classes, including thirty-three attack types and one benign type, collected by an IoT topology architecture with one hundred and five devices in reality. The thirty-three attack types are classified into seven malicious traffic types (i.e., DDoS, DoS, Recon, Web-based, brute force, spoofing, and Mirai) with the commonalities of attack. The whole dataset also can be divided into two classes, including malicious traffic and benign traffic. Compared with other datasets, CICIoT-2023 uses more devices to collect traffic in network topology, which leads to more numbers and classifications of traffic data. Furthermore, the issue of imbalance is pronounced in this dataset. Taking the thirty-four classification task as an example, the highest proportion of samples accounts for as much as 15.432%, while the lowest is only 0.003%. The performances of machine learning baseline models are listed in the original paper of CICIoT-2023, which can be used to evaluate the performance of enhancement methods.

Metrics In order to evaluate the performance of HSS in resource consumption and classification capability, the metrics this paper used are as follows:

  1. 1.

    Number of samples The number of samples can intuitively reflect the compression and simplification of the sampling set generated by HSS. A smaller sample number can accelerate the computation of models and reduce the resource consumption on the hardware.

  2. 2.

    Machine learning performance indicators To measure the optimization capability of HSS on baseline models, we choose the four standard indicators in machine learning models evaluation, including accuracy, recall, precision, and F1-score (Zhang et al. 2022), in which recall can reflect the models’ classification capability of small samples classifying in imbalanced dataset, and F1-score is used to evaluation the comprehensive performance of models.

  3. 3.

    t-SNE downscaled scatterplot To make the HSS sampling result more intuitive, we use the t-SNE downscaled scatterplot (Arora et al. 2018), a method that can downscale the data from high dimension to low dimension by calculating the similarity matrix, conditional probability, etc., of samples to visualize the distribution of original set and sampling set.

Performance before and after HSS

Fig. 3
figure 3

Performance of models before and after HSS in different classification tasks

We evaluate the performance of HSS according to the baseline model performance displayed in the original paper of CICIoT-2023. Our work compares the performance of original models and models with HSS from coarse classifying granularity to fine. Namely, the evaluation includes the performance of the classification tasks of two categories, eight categories, and thirty-four categories. In two-class classification, we take all the baseline models, including Logistic Regression, Perceptron, AdaBoost, and Random Forest, into comparison. However, in both eight-class and thirty-four class classification tasks, Logistic Regression and Perceptron are excluded because they are not designed to handle the multi-class classification tasks.

Figure 3 shows the performance of models before and after HSS in different classification tasks. Multiple classifiers include AdaBoost (Ada) and Random Forest (RF), and binary classifiers include Logistic Regression (LR) and Perceptron (Per). Figure 3a illustrates the performance of AdaBoost and Random Forest in the thirty-four class classification before and after using HSS. After being trained with the sampling set generated by HSS, AdaBoost demonstrates significant improvements across all four metrics. Accuracy improves from 0.6079 to 0.9904, recall increases from 0.6077 to 0.8519, precision grows from 0.4796 to 0.7524, and F1-score improves from 0.4735 to 0.7718. Compared to the baseline models, accuracy, recall, precision, and F1-score exhibit improvements of 62.92%, 40.18%, 56.88%, and 62.99%, respectively. In terms of the Random Forest, accuracy is almost flat compared with before, while recall, precision, and F1-score are slightly improved from the baseline models with ratios of 6.02%, 6.99%, and 10.19%. The comparison of combined HSS and original models’ performance in the eight-class classification is demonstrated in Fig. 3b. Due to the categories becoming smaller, which leads to the simplification of classification task, the performance of models in the eight-class classification has a moderate increase compared with it in the thirty-four class classification. In terms of the performance of AdaBoost, accuracy, recall, precision, and F1-score show improvements of 182.66%, 75.77%, 78.46%, and 128.16%, respectively. According to Random Forest, accuracy, precision, and F1-score boost from the baseline models with ratios of 0.20%, 18.72%, and 19.14%, recall decreases with ratios of 1.09%. Taking the results above into consideration, we conclude that HSS can effectively optimize the performance of machine learning models in imbalanced IoT malicious multi-class classification tasks, at least can optimize the performance of AdaBoost and Random Forest. The improvement of recall and F1-score means the sampling set generated by HSS makes models have a better capability in small sample recognition, enhancing models’ comprehensive performance. This improvement implies that models with HSS can recognize samples more comprehensively in IoT malicious traffic data, which is significant in today’s cybersecurity defense.

Figure 3c, d show the model performance with HSS in the two-class classification task. In contrast to multi-class classification tasks, recall in the two-class classification task improves, while other metrics exhibit varying degrees of decrease. Especially in Logistic Regression and Perceptron, accuracy and F1-score decrease slightly while precision falls heavily, which demonstrates that HSS does not perform in two-class classification task as well as in multi-class classification tasks. The reason why this phenomenon appears is the mergence of the labels. Namely, the seven attack traffic are all merged into malicious traffic, which alleviates the imbalance in the dataset. Although HSS significantly reduces the number of samples, the resulting small-sized sampling set may not provide sufficient training for the two-class classification models, resulting in a decline in performance. Moreover, in this CICIoT-2023, the so-called 2-class classification arises from the merging of 34-class labels into a single attack type. This means that thirty-three different types of malicious traffic were consolidated into one attack category. Given the complexity of the traffic features, while 2-class classification may appear straightforward, the distribution of samples in the high-dimensional feature space remains quite diverse. Therefore, directly subsampling for 2-class classification without considering the original thirty-four classes can lead to a lack of representativeness in the sampled data. However, in real-world situations, the specific type of malicious traffic is always unknown, so to simulate the actual situation, we do the sampling without considering the original distribution of the 34-class classification.

Impact of sampling threshold

Fig. 4
figure 4

Performance under different sampling threshold in multi-class classification tasks

Fig. 5
figure 5

Performance under different sampling threshold in 2-class classification tasks

As an indispensable part of HSS, the threshold settings for undersampling significantly impact the final performance of the method. Figures 4 and 5 show the performance of models in different classification tasks under the impact of sampling threshold respectively.

As shown in the hybrid data sampling method design, in the undersampling session, we take the type with the smallest number of samples as the reference and multiply the samples of other types by the coefficient a to get the threshold t. We decide the number of samples based on the relationship between the size of the sample size and t. The purpose of undersampling is to reduce the sample size of large samples to narrow the gap between them and small samples, which is helpful to alleviate the sample imbalance problem in the dataset. However, a threshold with an inappropriate value will cause numerous problems. Although a small threshold makes the sampling data balanced, the scale of the sampling set is too minimal to train the models. A big threshold makes a sampling set with sufficient instances but may lead to new imbalance problems. To find a relatively rational threshold, we define the comprehensive evaluation metric Score, which is the weight average of the four metrics, and use it to compare the performance of the baseline models with HSS on validation sets. To differentiate it from the F1-score, we use WS (weight score) to present Score.

Similar to the experiment designed before, in the multi-class classification tasks, we focus on the effect of the undersampling threshold on AdaBoost and Random Forest performance. In the two-class classification task, we add the Logistic Regression and Perceptron into the evaluation. Figure 4a, b show that in the thirty-four classification task, the two models’ accuracy is almost flat during the threshold change, while precision, F1-score, and WS all rise with the undersampling threshold. Recall decreases with the increase of the sampling threshold in AdaBoost and appears to fluctuate slightly in Random Forest. At 6 times threshold, the recall curve also reveals a declining trend. The increase of the two comprehensive metrics, F1-score and WS, demonstrates a slowing down trend with the up thresholds. As Fig. 4c, d show, in the eight-class classification task, the model metrics change the same way as in the thirty-four classification. It is noticeable that recall has decreased, and the upward trend of F1-score and WS has slowed down. Figure 5 demonstrates that in the two-class classification task, except for Perceptron, the other three model metrics show similar trends, although the decline in recall was slight. As for Perceptron, the model delivers a clear extreme point at 5 times the sampling threshold, which can be used as the termination point of sampling.

Summarizing the above, we can see that for HSS, the undersampling threshold does have a relatively large impact. While recall decreases with the increase of the sampling threshold, other metrics increase. The declining recall may reflect the models become more cautious in choosing positive instances, thinking with the increasing precious and F1-score, this phenomenon shows models with HSS perform better than before. For the termination condition of sampling, for one thing, as in the case of the perceptron performance of binary classification, the model may have an extreme value of the indicator as the threshold increases, and this extreme value point can be the termination point of the sampling threshold. For another, the model may also reach a ceiling as the metrics rise slower with increasing thresholds, and the termination point can be decided based on the trade-off between resource consumption and performance.

Change of sample distribution

Based on the result of undersampling with MTS, the oversampling in HSS makes the distribution of different sampling kinds more reasonable and balanced in feature space. To simplify the visualization results, we choose t-SNE to visualize the eight-class classification sampling set containing both before and after oversampling. We do not choose the thirty-four class classification sampling set because too many labels make the points in the visualization image dramatically chaotic. We do not choose the two-class classification sampling set because of the unsatisfactory result demonstrated above. According to the proportion of the number of different types of samples, an appropriate number of samples are randomly selected from the set for visualization, whose results are shown in Fig. 6. The undersampling result is also displayed in Table 1. It can be clearly seen that, after HSS, the proportion of small-sample data has increased significantly, and the imbalance of the data has been alleviated to a certain extent.

Fig. 6
figure 6

Visualization of sample distribution before and after oversampling

Table 1 Change of sample distribution before and after oversampling
Table 2 Shrink of samples after HSS in different classification tasks

As illustrated in Table 2, in the perspective of the sampling numbers, the numbers of the original training set and HSS sampling set, including sampling set of two-class, eight-class and thirty-four class, are 36346418, 5984091, 439376 and 193541. These sets processed by HSS shrink with the ratio of 83.53%, 98.79%, 99.46% comparing with the original set.

Comparison and generalizability

To further elaborate on the outstanding performance of HSS from different aspects, we conducted a cross-sectional comparison and a generalizability evaluation of HSS respectively. In the cross-sectional comparison, we assessed HSS alongside two comparable strategies. For generalizability evaluation, we applied HSS to another imbalanced traffic dataset to examine its generalizability. Additionally, we employed several deep learning models to assess the sampling sets processed by HSS, confirming its potential applicability to deep learning.

Cross-sectional comparison

We conducted a comparative analysis of HSS against two other sampling methods, both employing undersampling and oversampling techniques to address data imbalance in traffic datasets. Saber et al. (2018) integrated stratified sampling with SMOTE to augment the proportion of small samples, while (Liu and Liu 2014) developed undersampling and oversampling methods based on the ratio of the smallest and largest samples, respectively. Random Forest was selected as the evaluation model, with CICIoT-2023 (8-class classification task) as the dataset. In Table 3, it is evident that, HSS outperformed the other two methods across all four metrics leveraging the finely tuned sampling threshold strategy. To simplify the expression, we denote the sampling method in Saber et al. (2018) as SSS (Stratified Sampling with SMOTE) and the sampling method in Liu and Liu (2014) as EVS (Extreme Value Sampling).

Table 3 Performance comparison with HSS and other methods

Generalizability evaluation

We applied our sampling method, HSS, to another imbalanced traffic dataset named ISCXVPN2016 (Draper-Gil et al. 2016), which comprises only 8044 VPN application samples divided into 8 categories. Random Forest was also employed as the evaluation model in this segment due to its prevalence in traffic classification tasks. Table 4 demonstrates that HSS can be effective in diverse datasets. However, as the imbalance issue in ISCXVPN2016 is not as severe as in CICIoT-2023, and HSS was not specifically tuned for ISCXVPN2016, the improvement appears to be modest.

Table 4 Performance of HSS in ISCXVPN2016

Potential applicability to deep learning

Deep learning methods are also prevalent in today’s network traffic classification tasks. We evaluate the performance of three common deep learning models with HSS, including CNN, LSTM and Transformer, in 8-classs classification task on CICIoT-2023. As illustrated in Table 5, HSS exhibits significant variations in performance across different deep learning models. For CNN, the dataset processed by HSS demonstrates a considerable decrease in performance compared to the original dataset. However, there is still a notable improvement in recall by 19.82%. In contrast, for LSTM and Transformer, the dataset processed by HSS leads to an overall enhancement in model performance. In summary, it can be observed that HSS holds certain application potential for deep learning.

Table 5 Performance of HSS in deep learning models

Conclusion

In this paper, we propose the HSS, a novel hybrid sampling method that adeptly alleviates the imbalance issue prevalent in IoT malicious traffic data. The method ingeniously integrates the strengths of the MTS technique, which effectively reduces the large sample numbers based on the minimum sample count in minority classes and traffic distribution, and the SMOTENC technique, which augments the quantity of smaller samples by generating synthetic instances. To optimize the sampling process further, the HSS adjusts the threshold by the validation results from various model outcomes. Through a meticulous comparison between the models trained utilizing the HSS sampling set and the baseline models trained by the original training set, we have corroborated the superior performance of multi-class classification models, including AdaBoost and Random Forest, when adapted to the imbalanced data improved by the HSS sampling set. This paper substantiates that the HSS sampling method not only enhances the performance of classification models but also pioneers a pathway for more efficient and accurate data handling strategies in the realm of IoT malicious traffic data analysis. However, our work still needs to be improved in the performance of 2-class classification task for the imbalanced dataset. Our future endeavors will prioritize refining threshold selection methods within HSS to enhance their usability. Additionally, we aim to fine-tune the intricate sampling procedures to extend the applicability of HSS to more intricate tasks in IoT malicious traffic classification.

Availability of data and materials

All datasets used are public datasets.

References

  • Abdi L, Hashemi S (2015) To combat multi-class imbalanced problems by means of over-sampling techniques. IEEE Trans Knowl Data Eng 28(1):238–251

    Article  Google Scholar 

  • Al-Garadi MA, Mohamed A, Al-Ali AK et al (2020) A survey of machine and deep learning methods for internet of things (IoT) security. IEEE Commun Surv Tutor 22(3):1646–1685

    Article  Google Scholar 

  • Alghamdi R, Bellaiche M (2023) An ensemble deep learning based IDS for IoT using lambda architecture. Cybersecurity 6(1):5

    Article  Google Scholar 

  • Arora S, Hu W, Kothari PK (2018) An analysis of the t-SNE algorithm for data visualization. In: Conference on learning theory, PMLR, pp 1455–1462

  • Chawla NV, Bowyer KW, Hall LO et al (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357

    Article  Google Scholar 

  • Chen Z, Duan J, Kang L et al (2021) A hybrid data-level ensemble to enable learning from highly imbalanced dataset. Inf Sci 554:157–176

    Article  MathSciNet  Google Scholar 

  • Cho K, Park J, Oh TW et al (2020) One-sided Schmitt–Trigger-based 9T SRAM cell for near-threshold operation. IEEE Trans Circuits Syst I Regul Pap 67(5):1551–1561

    Article  Google Scholar 

  • Douzas G, Bacao F, Last F (2018) Improving imbalanced learning through a heuristic oversampling method based on k-means and smote. Inf Sci 465:1–20

    Article  Google Scholar 

  • Draper-Gil G, Lashkari AH, Mamun MSI, et al (2016) Characterization of encrypted and VPN traffic using time-related. In: Proceedings of the 2nd international conference on information systems security and privacy (ICISSP), pp 407–414

  • Du Y, Huang H, Sun YE, et al (2021) Self-adaptive sampling for network traffic measurement. In: IEEE INFOCOM 2021-IEEE conference on computer communications, IEEE, pp 1–10

  • Gök EC, Olgun MO (2021) SMOTE-NC and gradient boosting imputation based random forest classifier for predicting severity level of Covid-19 patients with blood samples. Neural Comput Appl 33(22):15693–15707

    Article  Google Scholar 

  • Han H, Wang WY, Mao BH (2005) Borderline-smote: a new over-sampling method in imbalanced data sets learning. In: International conference on intelligent computing, Springer, pp 878–887

  • Hart P (1968) The condensed nearest neighbor rule (corresp.). IEEE Trans Inf Theory 14(3):515–516

    Article  Google Scholar 

  • Huoh TL, Luo Y, Li P, et al (2022) Flow-based encrypted network traffic classification with graph neural networks. IEEE Trans Netw Serv Manag

  • Liu Q, Liu Z (2014) A comparison of improving multi-class imbalance for internet traffic classification. Inf Syst Front 16:509–521

    Article  MathSciNet  Google Scholar 

  • Li IJ, Wu JL, Yeh CH (2018) A fast classification strategy for SVM on the large-scale high-dimensional datasets. Pattern Anal Appl 21:1023–1038

    Article  MathSciNet  Google Scholar 

  • Li X, Liang Y, Yan S, et al (2019) A coordinated tiling and batching framework for efficient GEMM on GPUs. In: Proceedings of the 24th symposium on principles and practice of parallel programming, pp 229–241

  • Lichy A, Bader O, Dubin R et al (2023) When a RF beats a CNN and GRU, together—a comparison of deep learning and classical machine learning approaches for encrypted malware traffic classification. Comput Secur 124:103000

    Article  Google Scholar 

  • Lotfollahi M, Jafari Siavoshani M, Shirali Hossein Zade R et al (2020) Deep packet: a novel approach for encrypted traffic classification using deep learning. Soft Comput 24(3):1999–2012

    Article  Google Scholar 

  • Mahmud MS, Huang JZ, Salloum S et al (2020) A survey of data partitioning and sampling methods to support big data analysis. Big Data Min Anal 3(2):85–101

    Article  Google Scholar 

  • Mittal S (2016) A survey of techniques for architecting and managing GPU register file. IEEE Trans Parallel Distrib Syst 28(1):16–28

    Article  MathSciNet  Google Scholar 

  • Mittal S (2019) A survey on optimized implementation of deep learning models on the NVIDIA Jetson platform. J Syst Archit 97:428–442

    Article  Google Scholar 

  • Mittal S, Vaishay S (2019) A survey of techniques for optimizing deep learning on GPUs. J Syst Archit 99:101635

    Article  Google Scholar 

  • Neto ECP, Dadkhah S, Ferreira R et al (2023) CICIoT 2023: a real-time dataset and benchmark for large-scale attacks in IoT environment. Sensors 23:5941. https://doi.org/10.3390/s23135941

    Article  Google Scholar 

  • Qin J, Han X, Wang C, et al (2023) Network traffic classification based on SD sampling and hierarchical ensemble learning. Secur Commun Netw 2023

  • Rajmohan T, Nguyen PH, Ferry N (2022) A decade of research on patterns and architectures for IoT security. Cybersecurity 5:1–29

    Article  Google Scholar 

  • Rezaei S, Liu X (2019) Deep learning for encrypted traffic classification: an overview. IEEE Commun Mag 57(5):76–81

    Article  Google Scholar 

  • Rust-Nguyen N, Sharma S, Stamp M (2023) Darknet traffic classification and adversarial attacks using machine learning. Comput Secur 127:103098

    Article  Google Scholar 

  • Saber A, Fergani B, Abbas M (2018) Encrypted traffic classification: Combining over-and under-sampling through a PCA-SVM. In: 2018 3rd International conference on pattern analysis and intelligent systems (PAIS), IEEE, pp 1–5

  • Swana EF, Doorsamy W, Bokoro P (2022) Tomek link and smote approaches for machine fault classification with an imbalanced dataset. Sensors 22(9):3246

    Article  Google Scholar 

  • Tang Y, Zhang YQ, Chawla NV et al (2008) SVMs modeling for highly imbalanced classification. IEEE Trans Syst Man Cybern Part B (Cybern) 39(1):281–288

    Article  Google Scholar 

  • Wang W, Zhu M, Zeng X, et al (2017) Malware traffic classification using convolutional neural network for representation learning. In: 2017 International conference on information networking (ICOIN), IEEE, pp 712–717

  • Xiaolong X, Wen C, Yanfei S (2019) Over-sampling algorithm for imbalanced data classification. J Syst Eng Electron 30(6):1182–1191

    Article  Google Scholar 

  • Yang L, Finamore A, Jun F et al (2021) Deep learning and zero-day traffic classification: Lessons learned from a commercial-grade dataset. IEEE Trans Netw Serv Manag 18(4):4103–4118

    Article  Google Scholar 

  • Zhang Z, Ning H, Shi F, et al (2022) Artificial intelligence in cyber security: research advances, challenges, and opportunities. Artif Intell Rev 1–25

  • Zhang H, Yu L, Xiao X et al (2023) TFE-GNN: a temporal fusion encoder using graph neural networks for fine-grained encrypted traffic classification. In: Proceedings of the ACM web conference, vol 2023, pp 2066–2075

  • Zhou Y, Cheng G, Jiang S et al (2020) Building an efficient intrusion detection system based on feature selection and ensemble classifier. Comput Netw 174:107247

    Article  Google Scholar 

Download references

Acknowledgements

The authors would like to thank the reviewers for their constructive comments that have greatly improved the paper.

Funding

Not applicable.

Author information

Authors and Affiliations

Authors

Contributions

YL contributed to the conception and design of the study, data collection, analysis, and manuscript preparation. JT is the corresponding author. YZ gave some advice in experiments. YX helped to polish the writing.

Corresponding author

Correspondence to Jun Tao.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Luo, Y., Tao, J., Zhu, Y. et al. HSS: enhancing IoT malicious traffic classification leveraging hybrid sampling strategy. Cybersecurity 7, 11 (2024). https://doi.org/10.1186/s42400-023-00201-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s42400-023-00201-9

Keywords