Skip to main content

Table 2 Solutions using Deep Learning for eight security problems. The metrics in the Evaluation column include accuracy (ACC), precision (PRC), recall (REC), F1 score (F1), false positive rate (FPR), and false negative rate (FNR)

From: Using deep learning to solve computer security challenges: a survey

Security Problem

Works

Summary

Security Oriented Program Analysis (Shin et al. 2015; Chua et al. 2017; Guo et al. 2019; Xu et al. 2017)

RFBNN (Shin et al. 2015)

Phase I

Phase II

  
  

Dataset comes from previous paper (Bao et al. 2014), consisting of 2200 separate binaries. 2064 of the binaries were for Linux, obtained from the coreutils, binutils, and findutils packages. The remaining 136 for Windows consist of binaries from popular open-source projects. Half of the binaries were for x86, and the other half for x86-64.

They extract fixed-length subsequences (1000-byte chunks) from code section of binaries, Then, use “one-hot encoding”, which converts a byte into a \(\mathbb {Z}^{256}\) vector.

  
  

Phase III

Phase IV

Evaluation

  

N/A

Bi-directional RNN

ACC: 98.4%

PRE:N/A

    

REC:0.97

F1:0.98

    

FPR:N/A

FNR:N/A

 

EKLAVYA (Chua et al. 2017)

Phase I

Phase II

  
  

They adopted source code from previous work (Shin et al. 2015) as their rawdata, then obtained two datasets by using two commonly used compilers: gcc and clang, with different optimization levels ranging from O0 to O3 for both x86 and x64. They obtained the ground truth for the function arguments by parsing the DWARF debug information. Next, they extract functions from the binaries and remove functions which are duplicates of other functions in the dataset. Finally, they match caller snipper and callee body.

Tokenizing the hexadecimal value of each instruction.

  

Phase III

Phase IV

Evaluation

  

Word2vec technique to compute word embeddings.

RNN

ACC:81.0%

PRE:N/A

    

REC:N/A

F1:N/A

    

FPR:N/A

FNR:N/A

Defending Return Oriented Programming Attacks (Li et al. 2018; Chen et al. 2018; Zhang et al. 2019)

ROPNN (Li et al. 2018)

Phase I

Phase II

  

The data is a set of gadget chains obtained from existing programs. A gadget searching tool, ROPGadget is used to find available gadgets. Gadgets are chained based on whether the produced gadget chain is executable on a CPU emulator. The raw data is represented in hexadecimal form of instruction sequences.

Form one-hot vector for bytes.

  

Phase III

Phase IV

Evaluation

  

N/A

1-D CNN

ACC:99.9%

PRE:0.99

    

REC:N/A

F1:0.01

    

FPR:N/A

FNR:N/A

 

HeNet (Chen et al. 2018)

Phase I

Phase II

  
  

Data is acquired from Intel PT, which is a processor trace tool that can log control flow data. Taken Not-Taken (TNT) packet and Target IP (TIP) packet are the two packets of interested. Logged as binary numbers, information of executed branches can be obtained from TNT, and binary executed can be obtained from TIP. Then the binary sequences are transferred into sequences of values between 0-255, called pixels, byte by byte.

Given the pixel sequences, slice the whole sequence and reshape to form sequences of images for neural network training.

  
  

Phase III

Phase IV

Evaluation

  

Word2vec technique to compute word embeddings.

DNN

ACC:98.1%

PRE:0.99

    

REC:0.96

F1:0.97

    

FPR:0.01

FNR:0.04

Achieving Control Flow Integrity (Yagemann et al. 2019; Phan et al. 2017; Zhang et al. 2019)

Barnum (Yagemann et al. 2019)

Phase I

Phase II

  

The raw data, which is the exact sequence of instructions executed, was generated by combining the program binary, get immediately before the program opens a document, and Intel® PT trace. While Intel® PT built-in filtering options are set to CR3 and current privilege level (CPL), which only traces the program activity in the user space.

The raw instruction sequences are summarized into Basic Blocks with IDs assigned and are then sliced into manageable subsequences with a fix window size 32, founded experimentally. Only sequences ending on indirect calls, jumps and returns are analyzed, since control-flow hijacking attacks always occur there. The label is the next BBID in the sequence.

  
  

Phase III

Phase IV

Evaluation

  

N/A

LSTM

ACC:N/A%

PRE:0.98

    

REC:1.00

F1:0.98

    

FPR:0.98

FNR:0.02

 

CFG-CNN (Phan et al. 2017)

Phase I

Phase II

  

The raw data is instruction level control-flow graph constructed from program assembly code by an algorithm proposed by the authors. While in the CFG, one vertex corresponds to one instruction and one directed edge corresponds to an execution path from one instruction to another. The program sets for experiments are obtained from popular programming contest CodeChief.

Since each vertex of the CFG represents an instruction with complex information that could be viewed from different aspects, including instruction name, type, operands etc., a vertex is represented as the sum of a set of real valued vectors, corresponding to the number of views (e.g. addq 32,%rsp is converted to linear combination of randomly assigned vectors of addq value, reg). The CFG is then sliced by a set of fixed size windows sliding through the entire graph to extract local features on different levels.

  
  

Phase III

Phase IV

Evaluation

  

N/A

DGCNN with different numbers of views and with or without operands

ACC:84.1%

PRE:N/A

    

REC:N/A

F1:N/A

    

FPR:N/A

FNR:N/A

Defending Network Attacks (Millar et al. 2018; Zhang et al. 2019; Yuan et al. 2017; Varenne et al. 2019; Yin et al. 2017; Ustebay et al. 2019; Faker and Dogdu 2019)

50b(yte)-CNN (Millar et al. 2018)

Phase I

Phase II

  

Open dataset UNSW-NB15 is used. First, tcpdump tool is utilised to capture 100 GB of the raw traffic (i.e. PCAP files) containing benign activities and 9 types of attacks. The Argus, Bro-IDS (now called Zeek) analysis tools are then used and twelve algorithms are developed to generate totally 49 features with the class label. In the end, the total number of data samples is 2,540,044 which are stored in CSV files.

The first 50 bytes of each network traffic flow are picked out and each is directly used as one feature input to the neural network.

  
  

Phase III

Phase IV

Evaluation

  

N/A

CNN with 2 hidden fully connected layers

ACC:N/A%

PRE:N/A

    

REC:N/A

F1:0.93

    

FPR:N/A

FNR:N/A

 

PCCN (Zhang et al. 2019)

Phase I

Phase II

  

Open dataset CICIDS2017, which contains benign and 14 types of attacks, is used. Background benign network traffics are generated by profiling the abstract behavior of human interactions. Raw data are provided as PCAP files, and the results of the network traffic analysis using CICFlowMeter are pvodided as CSV files. In the end the dataset contains 3,119,345 data samples and 83 features categorized into 15 classes (1 normal + 14 attacks).

Extract a total of 1,168,671 flow data, including 12 types of attack activities, from original dataset. Those flow data are then processed and visualized into grey-scale 2D graphs. The visualization method is not specified.

  
  

Phase III

Phase IV

Evaluation

  

N/A

Parallel cross CNN.

ACC:N/A%

PRE:0.99

    

REC:N/A

F1:0.99

    

FPR:N/A

FNR:N/A

Malware Classification (De La Rosa et al. 2018; Saxe and Berlin 2015; Kolosnjaji et al. 2017; McLaughlin et al. 2017; Tobiyama et al. 2016; Dahl et al. 2013; Nix and Zhang 2017; Kalash et al. 2018; Cui et al. 2018; David and Netanyahu 2015; Rosenberg et al. 2018; Xu et al. 2018)

Rosenberg (Rosenberg et al. 2018)

Phase I

Phase II

  

The android dataset has the latest malware families and their variants, each with the same number of samples. The samples are labeled by VirusTotal. Then Cuckoo Sandbox is used to extract dynamic features (API calls) and static features (string). To avoid some anti-forensic sample, they applied YARA rule and removed sequences with less than 15 API calls. After preprocessing and balance the benign samples number, the dataset has 400,000 valid samples.

Long sequences cause out of memory during training LSTM model. So they use sliding window with fixed size and pad shorter sequences with zeros. One-hot encoding is applied to API calls. For static features strings, they defined a vector of 20,000 Boolean values indicating the most frequent Strings in the entire dataset. If the sample contain one string, the corresponding value in the vector will be assigned as 1, otherwise, 0.

  
  

Phase III

Phase IV

Evaluation

  

N/A

They used RNN, BRNN, LSTM, Deep LSTM, BLSTM, Deep BLSTM, GRU, bi-directional GRU, Fully-connected DNN, 1D CNN in their experiments

ACC:98.3%

PRE:N/A

    

REC:N/A

F1:N/A

    

FPR:N/A

FNR:N/A

 

DeLaRosa (De La Rosa et al. 2018)

Phase I

Phase II

  
  

The windows dataset is from Reversing Labs including XP, 7, 8, and 10 for both 32-bit and 64-bit architectures and gathered over a span of twelve years (2006-2018). They selected nine malware families in their dataset and extracted static features in terms of bytes, basic, and assembly features.

For bytes-level features, they used a sliding window to get the histogram of the bytes and compute the associated entropy in a window; for basic features, they created a fixed-sized feature vector given either a list of ASCII strings, or extracted import and metadata information from the PE Header(Strings are hashed and calculate a histogram of these hashes by counting the occurrences of each value); for assembly features, the disassembled code generated by Radare2 can be parsed and transformed into graph-like data structures such as call graphs, control flow graph, and instruction flow graph.

  
  

Phase III

Phase IV

Evaluation

  

N/A

N/A

ACC:90.1%

PRE:N/A

    

REC:N/A

F1:N/A

    

FPR:N/A

FNR:N/A

System Event Based Anomaly Detection (Du et al. 2017; Meng et al. 2019; Das et al. 2018; Brown et al. 2018; Zhang et al. 2019; Bertero et al. 2017)

DeepLog (Du et al. 2017)

Phase I

Phase II

  

More than 24 million raw log entries with the size of 2412 MB are recorded from the 203-node HDFS. Over 11 million log entries with 29 types are parsed, which are further grouped to 575,061 sessions according to block identifier. These sessions are manually labeled as normal and abnormal by HDFS experts. Finally, the constructed dataset HDFS 575,061 sessions of logs in the dataset, among which 16,838 sessions were labeled as anomalous

The raw log entries are parsed to different log type using Spell(Du and Li 2016) which is based a longest common subsequence. There are total 29 log types in HDFS dataset

  
  

Phase III

Phase IV

Evaluation

  

DeepLog directly utilized one-hot vector to represent 29 log key without represent learning

A stacked LSTM with two hidden LSTM layers.

ACC:N/A%

PRE:0.95

    

REC:0.96

F1:0.96

    

FPR:N/A

FNR:N/A

 

LogAnom (Meng et al. 2019)

Phase I

Phase II

  

LogAnom also used HDFS dataset, which is same as DeepLog.

The raw log entries are parsed to different log templates using FT-Tree (Zhang et al. 2017) according the frequent combinations of log words. There are total 29 log templates in HDFS dataset

  
  

Phase III

Phase IV

Evaluation

  

LogAnom employed Word2Vec to represent the extracted log templates with more semantic information

Two LSTM layers with 128 neurons

ACC:N/A%

PRE:0.97

    

REC:0.94

F1:0.96

    

FPR:N/A

FNR:N/A

Memory Forensics (Song et al. 2018; Petrik et al. 2018; Michalas and Murray 2017; Dai et al. 2018)

DeepMem (Song et al. 2018)

Phase I

Phase II

  

400 memory dumps are collected on Windows 7 x86 SP1 virtual machine with simulating various random user actions and forcing the OS to randomly allocate objects. The size of each dump is 1GB.

Construct memory graph from memory dumps, where each node represents a segment between two pointers and an edge is created if two nodes are neighbor

  
  

Phase III

Phase IV

Evaluation

  

Each node is represented by a latent numeric vector from the embedding network.

Fully Connected Network (FCN) with ReLU layer.

ACC:N/A%

PRE:0.99

    

REC:0.99

F1:0.99

    

FPR:0.01

FNR:0.01

 

MDMF (Petrik et al. 2018)

Phase I

Phase II

  

Create a dataset of benign host memory snapshots running normal, non-compromised software, including software that executes in many of the malicious snapshots. The benign snapshot is extracted from memory after ample time has passed for the chosen programs to open. By generating samples in parallel to the separate malicious environment, the benign memory snapshot dataset created.

Various representation for the memory snapshots including byte sequence and image, without relying on domain-knowledge of the OS.

  
  

Phase III

Phase IV

Evaluation

  

N/A

Recurrent Neural Network with LSTM cells and Convolutional Neural Network composed of multiple layers, including pooling and fully connected layers. for image data

ACC:98.0%

PRE:N/A

    

REC:N/A

F1:N/A

    

FPR:N/A

FNR:N/A

Fuzzing (Wang et al. 2019; Shi and Pei 2019; Böttinger et al. 2018; Godefroid et al. 2017; Rajpal et al. 2017)

DeepMem (Song et al. 2018)

Phase I

Phase II

  

The raw data are about 63,000 non-binary PDF objects, sliced in fix size, extracted from 534 PDF files that are provided by Windows fuzzing team and are previously used for prior extended fuzzing of Edge PDF parser.

N/A

  
  

Phase III

Phase IV

Evaluation

  

N/A

Char-RNN

ACC:N/A%

PRE:N/A

    

REC:N/A

F1:0.93

    

FPR:N/A

FNR:N/A

 

NEUZZ(Shi and Pei 2019)

Phase I

Phase II

  

For each program tested, the raw data is collected by running AFL-2.52b on a single core machine for one hour. The training data are byte level input files generated by AFL, and the labels are bitmaps corresponding to input files. For experiments, NEUZZ is implemented on 10 real-world programs, the LAVA-M bug dataset, and the CGC dataset.

N/A

  
  

Phase III

Phase IV

Evaluation

  

N/A

NN

ACC:N/A%

PRE:N/A

    

REC:N/A

F1:0.93

    

FPR:N/A

FNR:N/A

  1. 1Deep Learning metrics are often not available in fuzzing papers. Typical fuzzing metrics used for evaluations are: code coverage, pass rate and bugs