PUMD: a PU learning-based malicious domain detection framework

Fan, Zhaoshan; Wang, Qing; Jiao, Haoran; Liu, Junrong; Cui, Zelin; Liu, Song; Liu, Yuling

doi:10.1186/s42400-022-00124-x

Research
Open access
Published: 01 October 2022

PUMD: a PU learning-based malicious domain detection framework

Zhaoshan Fan^1,2,
Qing Wang^1,2,
Haoran Jiao^1,2,
Junrong Liu¹,
Zelin Cui¹,
Song Liu¹ &
…
Yuling Liu^1,2

Cybersecurity volume 5, Article number: 19 (2022) Cite this article

2824 Accesses
Metrics details

Abstract

Domain name system (DNS), as one of the most critical internet infrastructure, has been abused by various cyber attacks. Current malicious domain detection capabilities are limited by insufficient credible label information, severe class imbalance, and incompact distribution of domain samples in different malicious activities. This paper proposes a malicious domain detection framework named PUMD, which innovatively introduces Positive and Unlabeled (PU) learning solution to solve the problem of insufficient label information, adopts customized sample weight to improve the impact of class imbalance, and effectively constructs evidence features based on resource overlapping to reduce the intra-class distance of malicious samples. Besides, a feature selection strategy based on permutation importance and binning is proposed to screen the most informative detection features. Finally, we conduct experiments on the open source real DNS traffic dataset provided by QI-ANXIN Technology Group to evaluate the PUMD framework’s ability to capture potential command and control (C&C) domains for malicious activities. The experimental results prove that PUMD can achieve the best detection performance under different label frequencies and class imbalance ratios.

Introduction

As an important technical support of modern internet, DNS provides services for mapping domain name to IP address space. While providing users with convenient network services, domains have also been widely abused in malicious attacks, such as malware distribution, C&C communication, botnet control, phishing and spam.

The conduct of cyber attack activities often requires malicious domains as core resources to undertake the communication functions between the infected host and the attacker. Therefore, the detection of malicious domains is helpful to immediately block malicious activities and trace the source of attacks. In order to make the entire malicious infrastructure more robust, attackers often adopt resilient communication technologies such as Domain-flux or IP-flux. Thus, malicious domains are also different from benign domain names in characteristics of character level composition and communication traffic. In addition, due to the limited cost of attackers, they often register malicious domain names in batches or reuse malicious resources. Malicious domains can also be captured through domain registration information and the association among domains. In this paper, we focus on the detection scheme of active malicious domains that undertake communication functions in cyber attack activities, and capture malicious domains by analyzing character-level, traffic and registration characteristics, as well as the resource overlap of domains.

Most malicious domain detection solutions can be divided into three categories: black/white lists-based method (Sato et al. 2012; Kang and Lee 2007; Cao et al. 2008), knowledge-based method (Choi et al. 2007; Morales et al. 2009; Prieto et al. 2011; Villamarín-Salomón and Brustoloni 2009) , and machine learning-based method (Schiavoni et al. 2014; Yan et al. 2019; Sun et al. 2019; Shi et al. 2018; Liu et al. 2018; Wang et al. 2020; Tran et al. 2018; Tong et al. 2016; Du Peng 2020; Ma et al. 2014). In order to provide flexibility and resilience of communication, highly dynamic Domian-to-IP mapping was adopted in malicious infrastructure, static black/white list protection strategies are no longer applicable. Constructing malicious domain detection rules based on empirical knowledge is manpower consuming and easily bypassed. The mainstream detection methods use machine learning methods to automatically learn complex detection models. And the main problems faced by machine learning solutions include:

Insufficient credible label information: Due to the high cost of manual labeling, labeled samples mostly come from public blacklists (The Spamhaus Project Ltd 2021; Phishtank 2021; Andre Correa 2021; SURBL.ORG 2021) and popular domain list (ALEXA-INTERNET 2021), which are insufficient and unreliable. Specifically, Stevanovic et al. (2015) cross-checks the Alexa Top 1M sites which are often used as benign domains and confirmed that about 15% of these popular domains had appeared in at least one blacklist, which makes the labeling of benign domains more challenging.
Class imbalance: Malicious attack activities are hidden in voluminous normal DNS communications. Collecting real DNS traffic will build a data set with a severe imbalance between malicious and benign domains. Training directly on imbalanced data will make the classifier pay more attention to the majority class, while ignoring the ability to describe the minority malicious domains.
Incompact distribution of domains adopted by different malicious activities: Indeed, the abnormal characteristics of malicious domains may differ from activity to activity, such as special communication modes or abnormal character-level combinations. This is because different malicious activities may adopt different malicious communication technologies and domain name composition skills. Therefore, malicious domain samples of different activities loosely distribute in feature space, which affects classification performance.

Considering the above problems, we propose a malicious domain detection framework named PUMD. And the specific solutions and contributions of this paper are summarized as follows:

Propose a malicious domain detection framework based on PU learning. This framework can alleviate the problem of insufficient credible label information by using only a small number of labeled malicious domains to train classifier, and improve the impact of class imbalance via customizing sample weight to construct a cost-sensitive objective function.
Construct detection features from two perspectives: the single domain and the domain association. In particular, we propose novel evidence features based on resource overlapping association to improve the incompact distribution of malicious domain samples. Besides, we also introduce a feature selection strategy based on permutation importance and binning to enhance the characterization capabilities of feature set.
Compare PUMD with common machine learning methods and existing works on an open-source realistic imbalanced data set. Experiments prove that PUMD has the best detection performance and maintains system robustness under different label frequencies and class imbalance ratios.

The rest of this paper is organized as follows. “Background and motivation” section introduces the related works and explores the suitability of PU learning for malicious domain detection, “Proposed method” section describes the PUMD’s framework and technical details, and “Experiments” section shows the experiments and results. We discuss the superiority of PUMD compared with existing works and the future work in “Discussion and future work” section, and summarize our work in “Conclusion” section.

Background and motivation

Malicious domain detection

This section briefly describes related malicious domain detection works and clarifies why the PUMD is better than existing solutions. A comprehensive comparison will be made in “Discussion and future work” section.

Insufficient credible label information

Some solutions are designed to reduce the required label information. Schiavoni et al. (2014) and Yan et al. (2019) both use a strategy of filtering first and then clustering to compress the required label information. Besides, HinDom (Sun et al. 2019) adopts metapath-based transduction classification on the heterogeneous information network of malicious domain communication associations, which can reduce the proportion of initial label samples. However, these works do not solve the unreliable problem of using popular domains as benign domain. In addition, some solutions set filtering rules to filter out benign domains with low confidence. For example, Shi et al. (2018) requires labeled benign domain persist for three months in the Alexa Top 10K. This kind of solutions limits the benign learning samples of the detection model to popular domains, and lacks the ability to characterize unpopular benign domains. PUMD adopts PU learning model to solve the problem of insufficient credible label information, only a small amount of labeled malicious domains is required, and a wide range of unlabeled domain names are screened for credible benign domains. While compressing the required label information, PUMD enhances the ability to characterize the benign domains.

Class imbalance

The main solutions can be divided into two categories: (1) Data-level solutions, typically with data resampling strategy. Liu et al. (2018) undersamples the majority benign domains, which will lose part of the label information. While KSDom (Wang et al. 2020) oversamples the labeled malicious domains, which will increase the risk of overfitting, due to the similar construction of samples. (2) Algorithm-level solutions, typically with cost-sensitive methods. Tran et al. (2018) adopts cost-sensitive LSTM to alleviate the influence of class imbalance. However, considering the high calculation cost of the cost matrix, they set a fixed category weight, which requires prior knowledge of class imbalance ratio and not universal. PUMD adopts a customized sample weighting method, which effectively sets different sample weights according to the confidence each unlabeld sample labeling as benign receives and constructs a cost-sensitive objective function to solve the problem of class imbalance.

Incompact distribution of domain samples in different malicious activities

The malicious domain clusters with similar abnormal characteristics can be called “family” and there are three division schemes: DGA, malware, and malicious activity. The DGA family is identified by different DGA algorithms, mostly divided by character similarity (Schiavoni et al. 2014; Tong et al. 2016; Du Peng 2020). Besides, domain families determined by the malware type are detected by the similarity of malicious resources, such as the IP address or NS server (Ma et al. 2014). In addition, a malicious activity requires the coordination of different malicious domains, for example, a spam distribution activity involves the coordination of botnet C&C domains, malicious resource downloading domains, and spam content providing domains. It can be detected based on the characteristics of high co-occurrence probability and strong communication correlation (Sun et al. 2019). The existing binary classification works have not yet considered the incompact distribution of malicious domain samples. PUMD adopts novel evidence features based on resource overlapping association to effectively increase inter-class distance and reduce intra-class distance.

PU learning

PU learning is a semi-supervised learning technique that builds a binary classifier based on positive samples and unlabeled samples, in order to predict unlabeled samples. PU learning is suitable for dealing with problems in binary classification where one category of data is impure or only one category of label is available. Early works (Lee and Liu 2003; Elkan and Noto 2008) have confirmed that PU learning can reach the performance of standard supervised learning. Since PU learning only requires one category of label information and has excellent performance, it has aroused widespread interest in the field of machine learning, and has been applied in practical scenarios such as knowledge base completion, medical diagnosis, financial risk control.

In recent years, the cyber security field has also considered introducing PU learning to solve a series of security threat discovery problems (Zhang et al. 2017; Sun et al. 2017; Luo et al. 2018; Wu et al. 2019; Dhamnani et al. 2021). A typical application is a malicious URL attack detection system POSTER (Zhang et al. 2017). Specifically, considering the highly dynamic composition of URLs, it is difficult to manually label a large number of URL requests, POSTER combines two-step and cost-sensitive PU learning strategies, and uses a small number of malicious URLs and a large number of unmarked URLs to train binary classifiers and help network security engineers effectively discover potential attack patterns. In addition, PU learning also has typical applications in malware detection problem. Researchers usually take apps downloaded from trusted sources as benign samples. Sun et al. (2017) pays attention to the behavior of cyber attackers publishing malware on trusted sources, and proposes PUDROID, an Android malware detection system based on PU learning. Experiments have proved that PUDROID can find nearly 100% of the mixed malware.

Motivation

In this paper, we try to address an anomaly detection problem where the given dataset has the following three characteristics:

Dataset only has one class of credible label.
The number of credible annotated samples is rather small.
Dataset has severe class imbalance.

Current malicious domain name detection works mostly adopt supervised, unsupervised and traditional semi-supervised learning schemes. However, since both supervised and traditional semi-supervised learning schemes require the use of two classes of available label information, it is still necessary to use popular domains as benign domains, which leads to the noise problem in the labeled benign samples. Even though the malicious domains mixed in the popular domains can be eliminated by setting filtering rules, supervised and traditional semi-supervised learning schemes still limit the learning samples of benign domains to popular domains, resulting in the problem of poor modeling ability for non-popular benign domains. Beseides, the unsupervised learning schemes waste the guidance information of one class of credible label. Therefore, we can conclude that supervised, traditional semi-supervised and unsupervised learning schemes are not suitable for the research scenarios of this paper. In recent years, PU Learning schemes have agreat effect on dealing with one class of data that is impure and unavailable, which is very suitable for the characteristics of the dataset we introduced earlier. Therefore, we adopt the PU learning scheme to solve the problem of malicious domain name detection.

We further discuss why PU learning is effective for malicious domain detection problems. As we stated in “Introduction” section, constructing a benign domain ground truth is a difficult task, because we cannot identify a domain as a benign sample, even if it never appears in any known blacklist, but a batch of high-quality malicious domain labels can be obtained through pubilic blacklists or analysis by security researchers. This is just in line with the applicable scenario of PU learning, that is, the situation where one category of label information is hard to obtain. Therefore, we can use known malicious domains as labeled positive samples, and the domains to be detected as unlabeled samples, so as to obtain a malicious domain detection model through PU learning. Solve the problem of insufficient credible label information from the perspective of labeling dataset composition.

Proposed method

This section will introduce the framework and technical details of PUMD, as shown in Fig. 1, there are three main modules:

“Feature Extraction” module: Extract four types of typical features from two perspectives of the single domain and associations between domains, and a novel evidence feature is proposed to alleviate the influence of incompactly distributed malicious domains on detection performance.
“Feature Selection” module: Perform a feature selection strategy based on permutation importance and binning to find the most discriminative and informative feature subset among plenty of original features, which enhances the characterization capabilities of feature set.
“Two-Step PU Learning” module: Perform two-step PU learning processing to solve the problem of insufficient credible label information and class imbalance, in which step one is to obtain reliable negative(RN) samples based on iForest algorithm, and step two is to customize sample weight and train the classifier on a cost-sensitive objective function.

Feature extraction

Existing works either extract local detection features from the character composition, the traffic behavior or auxiliary information of a single domain (Shi et al. 2018; Liu et al. 2018; Schüppen et al. 2018; Almashhadani et al. 2020; Antonakakis et al. 2010; Huang et al. 2010), or extract global detection features from the perspective of correlation and resource overlaping among domain names (Sun et al. 2019; He et al. 2019). We combine two feature extraction ideas to extract character, traffic, and whois features from single domain and propose a novel evidence feature based on resource overlapping association, capturing the association between the domains to be detected and the known malicious domain families. This section introduces the four types of typical malicious domain identification features and explains why these features can characterize malicious nature.

Character feature

Distinguish benign domains and malicious domains from the perspective of domain name character-level composition, see Table 1. Structural features focus on the structural attributes of the domain name: Since most short domain names have been registered, attackers usually increase the domain name length and the number of subdomains to obtain a larger domain name structure space (Shi et al. 2018; Liu et al. 2018), so we extract two structural features: domain name length (F1), number of subdomains (F2). Linguistic features capture the deviation of domain names’ language mode: Considering that malicious domains are not memorable and readable, we extract five commonly used statistics (Schüppen et al. 2018; Almashhadani et al. 2020)to analyze the randomness of domain name characters: the number of special characters (F3), the number of numeric characters (F4), the conversion frequency of numeric and alphabetic characters (F5), the number of dictionary words (F6), and the number of unique length dicitonary words (F7).

Table 1 Character feature summarize

PUMD: a PU learning-based malicious domain detection framework

Abstract

Introduction

Background and motivation

Malicious domain detection

Insufficient credible label information

Class imbalance

Incompact distribution of domain samples in different malicious activities

PU learning

Motivation

Proposed method

Feature extraction

Character feature

Traffic feature

Whois feature

Evidence feature

Feature selection

Subset evaluation step based on permutation importance

Sequential backward search step based on binning

Two-step PU learning for malicious domain detection

Problem statement and notations

Step one: obtain reliable normal samples

Step two: train weighted binary classifier for malicious domains detection

Experiments

Experiments design

Dataset summarize

Data set

Ground truth and train-test split

Feature spatial distribution

Evaluating the classifiers and evaluation metrics

Classifiers and evaluation metrics selection

Analysis detection performance trends

Evaluating the feature extraction and selection

Evaluating the PUMD

PU learning schemes

Ordinary machine learning schemes

Quantitative comparison with related works

Quantitative experiment 1

Quantitative experiment 2

Discussion and future work

Qualitative comparison

Future direction

Conclusion

Availability of data and materials

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's Note

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords