In this paper, we transform the packed malware variants detection problem to a system calls classification problem. To reduce the obfuscation which is caused by packers, we first extract sensitive system calls and abandon obfuscated system calls. Then we organize these sensitive system calls as a vector which will be sent to our neural net- works later. As system call is a coarse-gained and sparse representation of executables, it causes bad training approximation and feature generalization. So we next propose our principal component initialized multi-layers neural networks to efficiently and effectively train and detect malicious instance with these sparse vectors.

Our approach contains the following two phases, a training phase and a detection phase. The work flow of our approach is shown in Fig. 1, in training phase, we monitor the system interactions of executables in Cuckoo sandbox (Malwr 2018) to obtain the system calls. Each profile of executables we got from Cuckoo sandbox contains several fields: time-stamp, system call, base address, file name, executing times, etc. We only consider system calls since it can give us enough information to describe characteristics of behaviours of malware while reducing the noise and redundant. Then, based on information gain (Peng et al. 2005), a selector is used for sensitive system calls extraction which select a series of high frequency system calls in malicious executables and abandon the other system calls which are common used anywhere. The selector output a vector organized by these sensitive system calls. Finally, our principal component initialized multi-layers neural networks train these sensitive system calls and obtain parameters which will be used for classification in the detection phase. In detection phase, our neural networks are equipped with these parameters to classify packed malware variants and packed benigns.

### Information gain based sensitive system calls extraction

We obtain system calls of executables by monitoring their running behaviours in Cuckoo sandbox. As modern malicious executables always are equipped with sophisticated packers, the system calls we got contain not only the system calls of originals but also the system calls of packers which obfuscate the distribution of original system calls. It limits the detection accuracy. To retain detection accuracy, in this paper, we first reduce the obfuscation from packers by extracting sensitive system calls. At the beginning, we give a definition of our sensitive system calls.

### Definition 1

The sensitive system calls is a part of system calls which highly frequently act in unpacked malicious executables while not in unpacked legitimate ones.

This insight is based on an important observation that the average distribution of sensitive system calls of unpacked malicious executables is nearly the same as packed ones, which means that our sensitive system calls also low frequently act in packers, as a deduction of our approach. So based on this deduction, we use the sensitive system calls as representation of malicious executables.

In this paper, we use information gain which has been widely used for feature selection. Let Y be the training data sets, where y_{1} is the malware data set and y_{2} is the benign data set. Let S be the set of total system calls, where s_{i} is the i^{th} system call in S. Let X be the set of sensitive system calls extracted from S, where x_{j} is the jthsensitive system call in X. To extract the sensitive system calls, we use information gain gain (s_{i}) as the weight for each system call siaccording to Eq. (1), where p (s_{i}) is the probability for each s_{i}, p (y_{1}) is the probability of malware variants, p (s_{i}|y_{1}) is the probability for each siin y_{1}, and t is a constant value. The gain (s_{i}) is larger when the siis more relevant to malicious executables.

$$ \mathrm{gain}\left({\mathrm{s}}_{\mathrm{i}}\right)=\mathrm{p}\left({\mathrm{s}}_{\mathrm{i}}\left|{\mathrm{y}}_1\right.\right)\cdot \frac{\log \left(\mathrm{p}\left({\mathrm{s}}_{\mathrm{i}}\right.{\left|\mathrm{y}\right.}_1\right)}{\mathrm{p}\left({\mathrm{s}}_{\mathrm{i}}\right)\cdot \mathrm{p}\left({\mathrm{y}}_1\right)} $$

(1)

Let f_{k} be the k^{th} executable in Y, we calculate the probability p (x_{j}|f_{k}) for each x_{j} in f_{k}, where N (f_{k}) is the total count of all sensitive system calls in f_{k} and N (x_{j}|f_{k}) is the total count of x_{j} in f_{k}, according to Eq. (2).

$$ \mathrm{p}\left({\mathrm{x}}_{\mathrm{j}}\left|{\mathrm{f}}_{\mathrm{k}}\right.\right)=\frac{\mathrm{N}\left({\mathrm{x}}_{\mathrm{j}}\left|{\mathrm{f}}_{\mathrm{k}}\right.\right)}{\mathrm{N}\left({\mathrm{f}}_{\mathrm{k}}\right)} $$

(2)

The p (x_{j}|f_{k}) as inputs will be sent to our principal component initialized multi-layers neural networks to detect malicious executables.

### Principal component initialized multi-layers neural networks for malware detection

Once we have extracted the sensitive system calls, in this section, we now discuss how to detect packed malware variants by using our principal component initialized multi-layers neural networks.

As an efficient classifier, neural networks are widely used for classification in many fields such as image recognition, natural language processing, etc. In this paper, we use neural networks to classify malicious and legitimate executables. Multi-layers neural networks (Fernándezcaballero et al. 2003; Esmaily et al. 2015; Salai Selvam et al. 2011; Salcedo Parra et al. 2014) as one of deep learning methods achieve faster convergence rate and higher accuracy rate by comparing with single hidden layer neural networks, but also bring some drawbacks, such as gradient disappearance, over-fitting, etc. To overcome these drawbacks and further improve convergence rate and accuracy rate, we propose our principal component initialized multi-layers neural networks.

The architecture of our neural networks is presented in Fig. 2, which has several layers, one input layer (a sensitive system call probability vector), principal component initialized feature layer, four hidden unit layers (consider the trade-of between accuracy and time cost, we choose four hidden unit layers to improve accuracy rate while retaining training and detection time consumption) and one output layer. During the forward pass, the neural networks first uses an orthogonal transformation to convert a set of inputs into a set of features of linearly uncorrelated variables called principal components (PCA 2017). These principal components allow to quickly converge our neural networks. Each principal component initialized feature fully connect the units in the next hidden layer and each unit in the hidden layer fully connect the next layer. The output is a vector consisted by 1 and 0 which separately represents the label of malware or benign. During back propagation, the neural networks use the gradient descent method (Gradient descent 2017) to propagate the variance from the output layer to the principal component initialized feature layers and update weight matrixes of connections between two layers.

We first assign the weight matrixes a set of random values and calculate the average probability of sensitive system call S_{avg} in training datasets. For each executable, push its p (x_{j}|f_{k}) as inputs into the neural networks. In the principal component initialized feature layer, the networks first calculate variance vector S_{var} (f_{k}) according to Eq. (3). Then, the networks calculate the covariance matrix and eigenvectors. Let cvmat be the covariance matrix, according to Eq. (4), where n_{trainingis} the count number of training samples.

$$ {\mathrm{S}}_{\mathrm{var}}{\left({\mathrm{f}}_{\mathrm{k}}\right)}_{\mathrm{j}}=\frac{\mathrm{p}\left({\mathrm{x}}_{\mathrm{j}}\left|{\mathrm{f}}_{\mathrm{k}}\right.\right)-{\mathrm{S}}_{{\mathrm{avg}}_{\mathrm{j}}}}{n_{training}} $$

(3)

$$ \mathrm{cvmat}=\frac{\sum {\mathrm{S}}_{\mathrm{var}}{\left({\mathrm{f}}_{\mathrm{k}}\right)}^{\mathrm{T}}\cdot {\mathrm{S}}_{\mathrm{var}}\left({\mathrm{f}}_{\mathrm{k}}\right)}{n_{traning}} $$

(4)

Let eigenV be the column eigenvectors according to cvmat, where eigenV_{i} is the i^{th} eigenvector in eigenV order by eigenvalue a_{i} from maximum to minimum, according to Eq. (5).

$$ \left|\mathrm{cvmat}\right.-\mathrm{a}\cdot \mathrm{E}\left|=0\right. $$

(5)

We organized top t eigenvectors (column vectors, t is 50 as our principal component initialized feature dimension number) to generate a new matrix eigenM and calculate the principal component initialized features pc_{j} according to Eq. (6). These features are the inputs for next hidden layers which enlarge the contrast of the average distribution between packed malicious executables and packed legitimate executables.

$$ {\mathrm{pc}}_j={S}_{va}r{\left({f}_k\right)}_j\cdot eigenM $$

(6)

Let u_{i}^{(1)} be the i^{th} unit in the first hidden layer, we calculate the u_{i}^{(1)} according to Eq. (7), where w_{j,i}^{(1)} is the weight matrix between the j^{th} principal component initialized layer and the i^{th} unit in the next hidden layer.

$$ {{\mathrm{u}}_{\mathrm{i}}}^{(1)}=\frac{1}{1+{e}^{-\sum {pc}_j\cdot {w_{j,\mathrm{i}}}^{(1)}}} $$

(7)

Let u_{i}^{(m + 1)} be the i^{th} unit in the (m + 1)^{th} hidden layer, we calculate the u_{i}^{(m + 1)} according to Eq. (8), where w_{j,i}^{(m + 1)} is the weight matrix between the j^{th} unit in the m^{th} hidden layer and the i^{th} unit in the (m + 1)^{th} hidden layer.

$$ {{\mathrm{u}}_{\mathrm{i}}}^{\left(\mathrm{m}+1\right)}=\frac{1}{1+{e}^{-\sum {u_j}^{(m)}\cdot {w_{j,i}}^{\left(m+1\right)}}} $$

(8)

Let h_{i} be the i^{th} unit in the output layer, we calculate the h_{i} according to Eq. (9), where w_{j,i}^{(n)} is the weight matrix between the j^{th} unit in the nth (last) hidden layer and the i^{th} unit in the output layer.

$$ {\mathrm{h}}_{\mathrm{i}}=\frac{1}{1+{e}^{-\sum {u_j}^{(n)}\cdot {w_{j,i}}^{(n)}}} $$

(9)

To approach the target, the neural networks trains the inputs and corrects weight matrixes through back propagation by using gradient descent method (Gradient descent 2017). The loss function we use in our method is square loss function E (x) according to Eq. (10), where H (x) which includes a set of h_{i} is the output of the neural networks and V is the real value.

$$ \mathrm{E}(x)=\sum {\left(\mathrm{V}\hbox{-} \mathrm{H}\left(\mathrm{x}\right)\right)}^2 $$

(10)

We update the weight matrix w_{j,i}^{(n + 1)} between the last hidden layer u_{i}^{(n)} and the output layer h_{i} according to the Eq.(11), where v_{i} is the real label value of an executable in the training set and a is a const value.

$$ {{\mathrm{w}}_{j,i}}^{\left(n+1\right)}={{\mathrm{w}}_{j,i}}^{\left(n+1\right)}+\alpha \cdot {u_j}^{(n)}\cdot {h}_i\left(1-{h}_i\right)\left({v}_i-{h}_i\right) $$

(11)

Let w_{j,i}^{(n)} be the weight matrix between the (n-1)^{th} hidden layer and the u_{i}^{(n)} hidden layer according to Eqs. (12 and 13), where var.^{(n)} is the variance between the (n-1)^{th} hidden layer and the n^{th} hidden layer.

$$ {{\mathrm{w}}_{j,i}}^{(n)}={{\mathrm{w}}_{j,i}}^{(n)}+\alpha \cdot {u_j}^{\left(n-1\right)}\cdot {u_i}^{(n)}\cdot \left(1-{v_i}^{(n)}\right)\cdot {\operatorname{var}}^{(n)} $$

(12)

$$ {\operatorname{var}}^{(n)}=\sum \left({\mathrm{v}}_{\mathrm{i}}-{\mathrm{h}}_{\mathrm{i}}\right)\cdot {w_{j,i}}^{\left(n+1\right)} $$

(13)

Let w_{j,i}^{(m + 1)} be the weight matrix between the m^{th} hidden layer and the (m + 1)^{th} hidden layer according to Eqs. (14 and 15), where var.^{(m + 1)} is the variance between the m^{th} hidden layer and the (m + 1)^{th} hidden layer.

$$ {{\mathrm{w}}_{j,i}}^{\left(m+1\right)}={{\mathrm{w}}_{j,i}}^{\left(m+1\right)}+\alpha \cdot {u_j}^{(m)}\cdot {u_i}^{\left(m+1\right)}\cdot \left(1-{u_i}^{\left(m+1\right)}\right)\cdot {\operatorname{var}}^{\left(m+1\right)} $$

(14)

$$ {\operatorname{var}}^{\left(m+1\right)}=\mathrm{v}{ar}^{\left(m+2\right)}\cdot {w_{j,i}}^{\left(m+2\right)} $$

(15)

Let w_{j,i}^{(1)} be the weight matrix between the principal component initialized feature layer and the first hidden layer according to Eqs. (16 and 17), where var.(1) is the variance between the principal component initialized layer and the first hidden layer.

$$ {{\mathrm{w}}_{\mathrm{j},\mathrm{i}}}^{(1)}={{\mathrm{w}}_{\mathrm{j},\mathrm{i}}}^{(1)}+\alpha \cdot {pc}_j\cdot {u_i}^{(1)}\cdot \left(1-{u_i}^{(1)}\right)\cdot {\operatorname{var}}^{(1)} $$

(16)

$$ {\operatorname{var}}^{(1)}={\operatorname{var}}^{(2)}\cdot {w_{j,i}}^{(2)} $$

(17)

After accomplishing training phase, we can obtain a set parameters, with which the neural networks equip to classify packed malware variants. The output is a vector consisted by two confidence values, each value separately represents the probability of malware or benign. When the confidence value of malware is big enough, we deem this detection is sufficiently believable and consider the target instance as a malware variants for the next retraining. We use a copy of our neural networks to retrain and generate new parameters with which will be equipped the current neural networks. To avoid poisonous data attack from crackers, we first prepare a set of already known testing cases and then use these cases to test the retrained neural networks. We next equip the current neural networks with these retrained parameters only if the testing accuracy do not suddenly drop.