- Open Access
Sensitive system calls based packed malware variants detection using principal component initialized MultiLayers neural networks
© The Author(s) 2018
- Received: 26 April 2018
- Accepted: 13 August 2018
- Published: 10 September 2018
Malware detection has become mission sensitive as its threats spread from computer systems to Internet of things systems. Modern malware variants are generally equipped with sophisticated packers, which allow them bypass modern machine learning based detection systems. To detect packed malware variants, unpacking techniques and dynamic malware analysis are the two choices. However, unpacking techniques cannot always be useful since there exist some packers such as private packers which are hard to unpack. Although dynamic malware analysis can obtain the running behaviours of executables, the unpacking behaviours of packers add noisy information to the real behaviours of executables, which has a bad affect on accuracy. To overcome these challenges, in this paper, we propose a new method which first extracts a series of system calls which is sensitive to malicious behaviours, then use principal component analysis to extract features of these sensitive system calls, and finally adopt multi-layers neural networks to classify the features of malware variants and legitimate ones. Theoretical analysis and real-life experimental results show that our packed malware variants detection technique is comparable with the the state-of-art methods in terms of accuracy. Our approach can achieve more than 95.6\% of detection accuracy and 0.048 s of classification time cost.
- Malware variants
- Multi-layers neural networks
- Principal component analysis
- Sensitive system calls
- Sophisticated packers
Malware is one of the major Internet security threats today, anti-detection mechanisms such as code-morphism make the malware evolved into many variants which make signatured based detection schemes perform poorly. Detecting malware variants improves signature based detection methods. In recent years, researchers focus on detecting malware variants by using machine learning methods, which transform the malware variants detection problem to a program similarity searching problem. When a new program is sufficiently similar to any signatured malicious program in a training data set, the program is checked as a malicious program.
Since malware analysis includes two kinds of ways: static analysis and dynamic analysis. Some researches, such as (Santos et al. 2011; Cesare et al. 2014; Nataraj et al. 2011; Zhang et al. 2016a; Zhang et al. 2016b; Yang et al. 2015; Raman et al. 2012), propose to use static analysis which extracts features from binaries without actually executing programs, such as operation codes, control flow graph, etc. to detect malware variants. However, when the malware variants had already packed, it prevents further analysis from disassembly tools, synthesis tools and other static analysis tools.
Modern malware variants are always equipped with sophisticated packers such as ASPack (2017), ASProtect (2017), UPX (2017), VMProtect (2017), ZProtect (2017), etc., which allow the malware variants bypass traditional and modern detection systems. These packers include two kinds of packers: encryption packers and compression packers, which work by taking an existing application, packing it, and then wrapping an unpacking utility around it, the unpacking utility works to unpack the inner executable in memory and transfers execution to it. The problem lies in the fact that there is nothing inherently malicious about a packer or unpacking code (Treadwell et al. 2009). When ignoring the packers, it is hard to detect if an executable is malicious due to the encryption or compression of the executable, which prevents detection systems from getting original features, especially for static analysis.
Such situation forces researches to adopt unpacking techniques or dynamic malware analysis to detect packed malware variants. However, there still exists some challenges. On one hand, some researches prefer to unpack packed programs and then detect the unpacked ones. But unpacking techniques cannot always be useful since crackers can write their private packers which are hard to be unpacked. On the other hand, another researches, such as (Zhang et al. 2016c; Huang et al. 2014; Xu et al. 2016; Kumar et al. 2012; Konrad et al. 2011; Bai et al. 2014; Santos et al. 2013), prefer to use dynamic analysis which monitors running interactions between operating system and programs in sandboxes or virtual machines to collect the features such as system calls, traffics, etc.. Although dynamic analysis can obtain running behaviours of a packed executable, the running behaviours not only include original behaviours but also include behaviours of packers of the executable which obfuscate the original behaviours. The existing methods do not take the obfuscation caused by behaviours of packers into considerations.
To overcome these challenges, in this paper, we aim to propose a novel approach which can detect packed malware variants without unpacking process. Since dynamic analysis can get running behaviours, we obtain a sequence of running system calls by monitoring system interactions in a sandbox.
Recently, there exist several related works on system call based analysis. Some of them prefer to use n-gram to represent the temporal sequential relationships of system calls and adopt classifiers to classify malicious executables and legitimate ones, such as (Konrad et al. 2011; Canzanese et al. 2015), etc.
However, to detect packed malware variants with these system calls, we have to address several challenging problems. One challenge is that the system calls of packers obfuscate the original distribution and hide the real malicious intention. In addition, as a high level representative of executables, system call is coarse-gained and sparse, which leads a bad generalization of features. What’s more, this sharpens the obfuscation problem caused by packers.
Since the system calls of malware variants which are in the same families share similar distributions, and there exist a significant difference of the distributions between malware and benign (Jang et al. 2015), some system calls are used more often in malware variants. We propose to extract a series of sensitive system calls, embed their frequencies into a vector and adopt deep learning method to solve these problems. Some recent researches also used deep learning for vulnerability or malware detection, which achieve better accuracy, such as (Li et al. 2018; Kolosnjaji et al. 2016), etc. We first extract a series of system calls which is more sensitive to malicious behaviours based information entropy theory. We call these system calls as sensitive system calls which reduce a degree of obfuscation. Then we embed the system calls to a vector by using occurrence frequency. The sensitive system calls will later be sent to a neural network for training or classification. Next we prefer to use multi-layers neural networks to train a model. Finally we use the model to detect and classify malware variants.
However, since such multi-layers neural networks exist some problems such as gradient disappearance and distributed representation, it is necessary to improve the convergence ability of the neural networks to achieve better performance. We propose a principal component initialized multi-layers neural networks method to accelerate the convergence rate and to improve the accuracy rate. The principal component initialization transforms the sensitive system calls to a few new column vectors which are linear combinations of the system calls, the new column vectors are linearly independent, which can reduce the computation complexity and accelerate convergence rate.
To reduce the obfuscation caused by packers, we extract a series system calls from unpacked instances which are more sensitive to malicious behaviours by learning with information gain, which do skip the unpacking knowledge.
To detect with sparse representation of sensitive system calls, we propose our principal component initialized multi-layers neural networks as an efficient and effective classifier to classify the packed malicious variants and packed legitimate ones.
The experimental results demonstrate that our approach 95.6% of detection accuracy and 0.048 s of classification time cost. What’s more, the evaluation results show that our approach achieves very low false positive rate which means it seldom make mistake in packed benign instances detection.
The remainders of this paper is organized as follows. Section “Methodology” presents our packed malware variants detection technique. Section “Experiments” shows the experimental results and Section “Related Works” introduces the related works. Section “Limitations” and Section “Conclusions” show the limitation and conclusion.
In this paper, we transform the packed malware variants detection problem to a system calls classification problem. To reduce the obfuscation which is caused by packers, we first extract sensitive system calls and abandon obfuscated system calls. Then we organize these sensitive system calls as a vector which will be sent to our neural net- works later. As system call is a coarse-gained and sparse representation of executables, it causes bad training approximation and feature generalization. So we next propose our principal component initialized multi-layers neural networks to efficiently and effectively train and detect malicious instance with these sparse vectors.
Information gain based sensitive system calls extraction
We obtain system calls of executables by monitoring their running behaviours in Cuckoo sandbox. As modern malicious executables always are equipped with sophisticated packers, the system calls we got contain not only the system calls of originals but also the system calls of packers which obfuscate the distribution of original system calls. It limits the detection accuracy. To retain detection accuracy, in this paper, we first reduce the obfuscation from packers by extracting sensitive system calls. At the beginning, we give a definition of our sensitive system calls.
The sensitive system calls is a part of system calls which highly frequently act in unpacked malicious executables while not in unpacked legitimate ones.
This insight is based on an important observation that the average distribution of sensitive system calls of unpacked malicious executables is nearly the same as packed ones, which means that our sensitive system calls also low frequently act in packers, as a deduction of our approach. So based on this deduction, we use the sensitive system calls as representation of malicious executables.
The p (xj|fk) as inputs will be sent to our principal component initialized multi-layers neural networks to detect malicious executables.
Principal component initialized multi-layers neural networks for malware detection
Once we have extracted the sensitive system calls, in this section, we now discuss how to detect packed malware variants by using our principal component initialized multi-layers neural networks.
As an efficient classifier, neural networks are widely used for classification in many fields such as image recognition, natural language processing, etc. In this paper, we use neural networks to classify malicious and legitimate executables. Multi-layers neural networks (Fernándezcaballero et al. 2003; Esmaily et al. 2015; Salai Selvam et al. 2011; Salcedo Parra et al. 2014) as one of deep learning methods achieve faster convergence rate and higher accuracy rate by comparing with single hidden layer neural networks, but also bring some drawbacks, such as gradient disappearance, over-fitting, etc. To overcome these drawbacks and further improve convergence rate and accuracy rate, we propose our principal component initialized multi-layers neural networks.
After accomplishing training phase, we can obtain a set parameters, with which the neural networks equip to classify packed malware variants. The output is a vector consisted by two confidence values, each value separately represents the probability of malware or benign. When the confidence value of malware is big enough, we deem this detection is sufficiently believable and consider the target instance as a malware variants for the next retraining. We use a copy of our neural networks to retrain and generate new parameters with which will be equipped the current neural networks. To avoid poisonous data attack from crackers, we first prepare a set of already known testing cases and then use these cases to test the retrained neural networks. We next equip the current neural networks with these retrained parameters only if the testing accuracy do not suddenly drop.
In this section, we present several real-life experiments to show the performance of our approach. In the following, we first present the experiment setup, the data set and the cross validation of our approach. Then, we present several the state-of-art methods for comparison. In the last, we give differential analysis, convergence process analysis, accuracy evaluation and time cost evaluation of our approach.
Experiment setup, data set and validation
We implement our approach on one computer. The version of the CPU is i5–6500 @ 3.20GHz, the RAM is 16.0GB, the operation system is Windows 7. Our approach is developed by Java programming language with Jre 1.6.
To validate our approach, we use two different data sets to test our malware variant detection methodology: one is for training, the other is for detection. The training data set includes 3167 unpacked malware executables and 2894 unpacked benign executables. The detection data set includes 2083 packed malware variants and 1986 packed benign instances. We use several packers to pack the malware variants and benign instances, such as ASPack, ASProtect, UPX, VMProtect, Armadillo, ZProtect, etc.
The malware data set
The benign data set
To evaluate the performance of our approach, we use k-fold cross validation in our experiments. In this way, for each group of experiments, the training data set was split into 10 groups. For each group, we randomly select 2000 unpacked malware executables and 2000 benign executables for training, and the remained 2083 packed malware variants and 1986 packed benign instances are used for classification and detection. The benchmarks we used for evaluation include classification accuracy, true positive rate (TPR), false positive rate (FPR), precision, recall, training iterations, detection time cost.
Differential analysis of distributions of system calls between malware and benign
Convergence process analysis of different layers of our neural networks
In this subsection, we analyze the convergence process of our principal component initialized multi-layers neural networks and show the efficiency of the neural networks.
When we append more hidden layers (more than 4 layers) to our networks, the detection accuracy increases slower while the detection time cost grows rapidly. So considering the trade-of between accuracy and time consumption, we choose 4 hidden layers as the best layers of our approach to improve detection accuracy rate while retaining training and detection time consumption.
State-of-art approaches for comparison
To demonstrate that our approach is efficient and effective, we compare our approach with the other state-of-art methods, such as (Konrad et al. 2011; Canzanese et al. 2015). Konrad Rieck et al. (2011) proposed to embedding system calls to a vector and adopt clustering and classification methods to detect malware. Raymond Canzanese et al. (2015) proposed to use a vector of system call n-gram frequencies and several classifiers such as Support Vector Machine, Logistic regression, etc. to detect malware.
Time cost analysis
The comparisons of training iterations and detection time cost
Detection time cost
Konrad Rieck et al. (clustering)
Raymond Canzanese et al. (SVM)
Our approach (1 hidden layers)
Our approach (2 hidden layers)
Our approach (3 hidden layers)
Our approach (4 hidden layers)
From the comparisons of detection time cost, we find that our approach can significantly improve the detection speed by comparing with the other state-of-art methods. Because in detection phase, our neural networks only need to forward pass the inputs with the already trained parameters which cost only a little time, while the other methods need to search similarities in a serial manner which cost a longer delay.
Malware is a pervasive problem in distributed computer and network systems today. Many approaches were proposed to detect malware by using machine learning. Some of them prefer to use static analysis, the rest prefer to use dynamic analysis. However, when facing varied packers, these methods cannot always perform well. In this section, we review some of them below.
I. Santos et al. (2011) proposed a data mining technique to mine the relevance of each op-code and assess the frequency of each opcode sequence, and then used Euclidean space (Euclidean Space, (2017) to measure the distance between software instances. S. Cesare et al. (2014) proposed a technique that performs similarity searching of sets of control flow graphs. L. Nataraj et al. (2011) proposed a method for visualizing and classifying malware binaries as gray-scale images. J. Zhang et al. (2016a; Zhang et al. 2016b) proposed to convert opcodes into 2-D matrix and used image processing method to recognize the malware executables. W. Yang et al. (2015) proposed an approach of static analysis that extracts the contexts of security-sensitive behaviors to assist app analysis in differentiating between malicious and benign behaviors. However, a number of malware authors use packing techniques to compress and encrypt the malicious codes, which make these approaches cannot work if the packers cannot be identified or unpacked.
R. Konrad et al. (2011) proposed to automatically identifying novel classes of malware with similar sequential system calls and assigning unknown malware to these discovered classes. H. Bai et al. (2014) proposed to identify malware variants by using support vector machine with malicious behaviours which are triggered with their resulting outcomes. C. Kumar et al.’s (2012) check whether a target’s system call dependency follows the same dependency of signatured malware. H. Zhang et al. (2016) proposed discoverd the underlying triggering relations of a amount of network events which detected malware activities on a host. J. Huang et al. (2014) analyzed the user interface component associated with the top level function and find the mismatch of the two to detect stealthy behaviour. L. Xu et al. (2016) implemented graph-based representation for system calls, then used the graph kernels to compute pair-wise similarities and feed these similarity measures into a support vector machine for classification. I. Santos et al. (2013) proposed a hybrid malware variant detector called OPEM, which utilizes a set of features obtained from both static and dynamic analysis of malicious code. However, these mentioned approaches do not consider packers’ behaviours which obfuscate original behaviours of executables.
G. Suarez-Tangil et al.’s work (2016) proposes to analyze the behavioral differences be- tween the original app and some automatically repackaged versions of it, however, when a new variant was packed by an unknown tool, their approach can no longer work because it has not analyzed the differences yet. Z. Shehu et al.’s work (2016) proposes to compute a execution fingerprint of an obfuscated app, and compare it to an available database of fingerprints of known malwares to discover possible matches, however, this matches can be easily confused by varied packers and benigns. J. Calvet et al.’s work (2012) proposes a method for identifying cryptographic functions, K. Coogan et al.’s work (2009) proposes to identify the transition points in the code where execution transitions from unpacker code to the unpacked code, P. Royal et al.’s work (2006) proposes a tool named PolyUnpacker which observes the sequences of packed or hidden code in a malware can be made self-identifying when its runtime execution is checked against its static code model. However, these unpacking techniques cannot always be helpful since not all packers can be unpacked.
As our approach is based on deep learning method which might be attacked by adversaries, this causes another security problem. Although we design a retraining and testing process to avoid poisonous data attack from crackers and retain the detection performance (in Section “Methodology”), a persistent attack could disable a further updation of our neural networks brought retraining with new detected samples.
In this paper, we propose a novel approach which can detect packed malware variants without unpacking. To achieve our approach, we propose a sensitive system call based principal component initialized multi-layers neural networks, which can highly perform well in terms of classification accuracy and speed. Theoretical analysis and real-life experimental results show that our packed malware variants detection technique is comparable with the the state-of-art methods.
As a future work, besides of system calls, we will take more running behaviours such as connections, user operations, etc. into consideration to strength our detection. In addition, we will focus on protecting our malware detection system from poisonous data attack.
This work is partially supported by the National Science foundation of China under Grant No. 61772191, No. 61472131.
National Science foundation of China under Grant No. 61772191, No. 61472131.
Availability of data and materials
Our malware instances are downloaded from the VxHeavens (2017) website.
I would like to declare on behalf of my co-authors that the work described was original research that none of the data and material in the paper has been published or is under consideration for publication elsewhere. All the authors listed have approved the manuscript that is enclosed.
In this work, Jixin Zhang and Kehuan Zhang conceived the paper, verified and conducted the analysis and the results. Jixin Zhang and Hui Yin designed and developed the prototype. Jixin Zhang and Qixin Wu wrote the text presented here. Hui Yin and Qixin Wu collected the data and prepared the data. Kehuan Zhang and Zheng Qin supervised the whole process. All authors provided input and approved the manuscript.
No conflict of interest exits in the submission of this manuscript, and manuscript is approved by all authors for publication.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
- ASPack, http://www.aspack.com (2017)
- Bai H et al (2014) Approach for malware identification using dynamic behaviour and outcome triggering. IET Inf Secur 8(2):140–151View ArticleGoogle Scholar
- Calvet J et al (2012) Aligot: Cryptographic Function Identification in Obfuscated Binary Programs. In: Proc. of ACM Conference on Computer and Communications Security, pp 169–182Google Scholar
- Canzanese R. et al. (2015) System call-based detection of malicious processes. In proc. of 2015 IEEE international conference on software quality, Reliability and Security, 119–24Google Scholar
- Cesare S et al (2014) Control flow-based malware variant detection. IEEE Trans Dependable and Secure Comput 11(4):307–317View ArticleGoogle Scholar
- Coogan K et al (2009) Automatic Static Unpacking of Malware Binaries. In: Proc. of Working Conference on Reverse Engineering, pp 167–176Google Scholar
- Esmaily J et al (2015) Intrusion detection system based on Multi-Layer Perceptron Neural Networks and Decision Tree. In: Proc. of IEEE Conference on Information and Knowledge Technology, pp 1–5Google Scholar
- Euclidean Space, https://en.wikipedia.org/wiki/Euclidean_space (2017)
- Fernándezcaballero A et al (2003) On motion detection through a multi-layer neural network architecture. Neural Netw 16(2):205–222View ArticleGoogle Scholar
- Gradient descent, https://en.wikipedia.org/wiki/Gradient_descent (2017)
- Huang J et al (2014) AsDroid detecting stealthy behaviors in Android applications by user interface and program behavior contradiction. In: Proc. of ACM/IEEE International Conference on Software Engineering, pp 1036–1046Google Scholar
- Jang J et al (2015) Mal-Netminer: Malware Classification Approach Based on Social Network Analysis of System Call Graph. In: Proc. of the 23rd international conference on World wide web companion pp 731–34.Google Scholar
- Kolosnjaji B et al (2016) Deep Learning for Classication of Malware System Call Sequences. In: Proc. of Australasian Joint Conference on Artificial Intelligence pp 137–149Google Scholar
- Konrad R et al (2011) Automatic analysis of malware behavior using machine learning. J Comput Secur 19:639–668View ArticleGoogle Scholar
- Kullback-Leibler divergence, https://en.wikipedia.org/wiki/Kullback-Leibler_divergence (2018)
- Kumar C et al (2012) Obfuscated Malware Detection Using API Call Dependency. In: Proc. Of ACM International Conference on Security of Internet of Things, pp 289–300Google Scholar
- Li Z. et al.: VulDeePecker: A Deep Learning-Based System for Vulnerability Detection. In Proc. of arXiv:1801.01681v1 [cs.CR] (2018)Google Scholar
- Malwr, https://malwr.com/ (2018)
- Nataraj L et al (2011) A Comparative Assessment of Malware Classification using Binary Texture Analysis and Dynamic Analysis. In: Proc. of ACM Workshop on Security & Artificial Intelligence, pp 21–30Google Scholar
- PCA, https://en.wikipedia.org/wiki/Principal_component_analysis (2017)
- Peng H et al (2005) Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238View ArticleGoogle Scholar
- Raman K et al (2012) Selecting features to classify malware. In: InfoSec SouthwestGoogle Scholar
- Receiver Operating Characteristic, https://en.wikipedia.org/wiki/Receiver_operating_characteristic (2018)
- Royal P et al (2006) PolyUnpack: Automating the Hidden-Code Extraction of Unpac Executing Malware. In: Proc. of 22nd Annual Computer Security Applications Conference, pp 289–300Google Scholar
- Salai Selvam V et al (2011) Brain tumor detection using scalp eeg with modified Wavelet-ICA and multi layer feed forward neural network. In: Proc. of Annual International Conference of the IEEE Engineering in Medicine and Biology Society, pp 6104–6109Google Scholar
- Salcedo Parra O et al (2014) Traffic forecasting using a multi layer perceptron model. In: Proc. of ACM symposium on QoS and security for wireless and mobile networks, pp 133–136Google Scholar
- Santos I et al (2011) Opcode sequences as representation of executables for data mining based malware variant detection. Inf Sci 231(9):64–82Google Scholar
- Santos I et al (2013) OPEM: A Static-Dynamic Approach for Machine Learning Base Malware Detection. In: Proc. of International Conference CISIS’12, pp 271–280Google Scholar
- Shehu Z et al (2016) Towards the Usage of Invariant-Based App Behavioral Fingerprinting for the Detection of Obfuscated Versions of Known Malware. In: Proc. of IEEE International Conference on Next Generation Mobile Applications, Security and Technologies, pp 289–300Google Scholar
- Suarez-Tangil G et al (2016) ALTERDROID: differential fault analysis of obfuscated smart-phone malware. IEEE Trans Mob Comput 15(4):789–802Google Scholar
- Treadwell S et al (2009) A Heuristic Approach for Detection of Obfuscated Malware. In: Proc. of IEEE International Conference on Intelligence & Security Informatics, pp 291–299Google Scholar
- UPX, https://upx.github.io (2017)
- VMProtect, https://vmpsoft.com/products/vmprotect/ (2017)
- VX Heaven, https://hypestat.com/info/vxheaven.org (2017)
- Xu L et al (2016) Dynamic Android Malware Classification Using Graph-Based Representations. In: Proc. of IEEE International Conference on Cyber Security and Cloud Computing, pp 220–231Google Scholar
- W. Yang et al. (2015) AppContext: differentiating malicious and benign mobile app behaviors using context. In: Proc. of IEEE/ACM International Conference on Software Engineering (2015), Firenze, Italy, pp 303–313Google Scholar
- Zhang J et al (2016a) Malware Variant Detection Using Opcode Image Recognition with Small Training Sets. In: Proc. of IEEE International Conference on Computer Communication and Networks, pp 1–9Google Scholar
- Zhang J et al (2016b) IRMD: Malware Variant Detection Using Opcode Image Recognition. In: Proc. of IEEE International Conference on Parallel and Distributed Systems, pp 1175–1180Google Scholar
- Zhang H et al (2016c) Detection of stealthy malware activities with traffic causality and scalable triggering relation discovery. ACM Transactions on Privacy and Security 19(2):article 4Google Scholar
- ZProtect, https://tuts4you.com/download.php?view.3017 (2017)