Bin2vec: learning representations of binary executable programs for security tasks

Tackling binary program analysis problems has traditionally implied manually defining rules and heuristics, a tedious and time consuming task for human analysts. In order to improve automation and scalability, we propose an alternative direction based on distributed representations of binary programs with applicability to a number of downstream tasks. We introduce Bin2vec, a new approach leveraging Graph Convolutional Networks (GCN) along with computational program graphs in order to learn a high dimensional representation of binary executable programs. We demonstrate the versatility of this approach by using our representations to solve two semantically different binary analysis tasks – functional algorithm classification and vulnerability discovery. We compare the proposed approach to our own strong baseline as well as published results, and demonstrate improvement over state-of-the-art methods for both tasks. We evaluated Bin2vec on 49191 binaries for the functional algorithm classification task, and on 30 different CWE-IDs including at least 100 CVE entries each for the vulnerability discovery task. We set a new state-of-the-art result by reducing the classification error by 40% compared to the source-code based inst2vec approach, while working on binary code. For almost every vulnerability class in our dataset, our prediction accuracy is over 80% (and over 90% in multiple classes).


Introduction
For many security problems, researchers are relying on binary code analysis, as they need to inspect binary executable program files without access to any source code. This is often needed when analyzing commercial code that is protected by intellectual property and its source code is not available, but can be also useful in other scenarios. Those include dealing with unsupported or legacy executables, where the information about the exact version of the source code is lost, or even the original source code itself may be lost. Additionally, it is frequently used for testing in order to improve the security of the system, like in black-box penetration testing when the goal is to check the binary for any weaknesses or vulnerabilities that can potentially be abused. And finally, it is an important part of investigating how hard recovering key parts of the *Correspondence: shushana@usc.edu Information Sciences Institute, 4676 Admiralty Way, Marina Del Rey, CA, USA algorithm is, e.g., for the sake of preventing intellectual property theft.
The translation process of going from source code to binary executable programs also called compilation, is a lossy process in the sense that only basic low-level instructions and data representations understood by the target CPU are preserved. Because of this, it is impossible in the general case to reconstruct the original source code from compiled binary code. The task is even more complicated with commercial binaries as they are often stripped. Stripping of the binary removes any debug information and its symbol tables, which contain semantics of variables in the program. When dealing with stripped binaries, even reconstructing function entry points can be challenging.
With a constantly growing number of computing devices in consumer, commercial, industrial and infrastructure applications, as well as with the growing complexity of software applications, the scope of binary code analysis becomes increasingly large. Fast, automated analysis would allow preventing the spreading of bugs and vulnerabilities in all those complex software systems through shared and reused code.
Analyzing binary executable code is difficult because of two related challenges -the size of binary executable programs and the absence of high-level semantic structure in binary code. Indeed, when dealing with a compiled executable, a security engineer is often looking at a file containing up to megabytes of binary code. A precise analysis of such files with existing tools requires large amounts of computational power, and it is particularly difficult or even impossible to do manually. Instead, state-of-the-art tools often rely on a combination of formal models and heuristics to reason about binary programs. Replacing these heuristics with more advanced statistical learning and machine learning models has a high potential for improving performance while keeping the analysis fast.
In recent years we have seen a big surge in applications of machine learning (ML) to the field of security, where researchers routinely turn to ML algorithms for smarter automated solutions. For example, due to rapidly evolving modifications of malware, ML algorithms are frequently applied to malware detection problems. Similarly, ML algorithms allow detecting and reacting to network attacks faster.
Having ML algorithms operate on binary executable programs is a promising direction to bridge the large semantic gap between human abstractions and machine code representations, and to recover high-level semantics which was lost during compilation. Using ML requires obtaining a good, vectorized representation of the data. In the field of security, this problem is usually solved by hand-selecting useful features and feeding those into an ML algorithm for a prediction or a score. Approaches range from defining code complexity metrics and legacy metrics (Theisen et al. 2015), to using a sequence of system calls (Grieco et al. 2016) and many more. Besides being non-trivial and laborious, hand-selecting features raises other issues as well. First, for every task researchers come up with a new set of features. For example, what indicates memory safety violations is unlikely to also signal race conditions. Additionally, some features get outdated and will need to be replaced with future versions of the programming language, compiler, operating system or computer architecture.
The state-of-the-art in machine learning, however, no longer relies on hand-designed features. Instead, researchers use learned features, or what is called distributed representations. These are high-dimensional vectors, modeling some of the desired properties of the data. The famous word2vec model (Mikolov et al. 2013a;Mikolov et al. 2013b), for example, is representing words in a high-dimensional space, such that similar words are clustered together. This property of word2vec has made it a long-time go-to model for representing words in a number of natural language processing tasks. We can take another example from computer vision, where it was discovered that outputs of particular layers of VGG network (Simonyan and Zisserman 2015) are useful for a range of new tasks.
We see an important argument for trying to learn distributed representations -a good representation can be used for new tasks without significant modifications. Unfortunately, some types of data are more challenging to obtain such a representation for, than others. For instance, finding methods for representing longer sentences or paragraphs is still an ongoing effort in natural language processing Lin et al. 2017b). Representing graphs and incorporating structure and topology into distributed representations is not fully solved either. Binary executable programs are a "hard" case for representing as they have traits of both longer texts and structured, graph-like data, with important properties of binaries best represented as control or data flow graphs.
Distributed representations for compiled C/C++ binaries -the kind that engineers in the security field deal with the most -have not received much attention, and with this work, we hope to start filling that gap. In fact, current approaches leveraging deep learning models to reason about binary code focus on code clone detection, and therefore, their application to algorithm classification and vulnerability detection is limited to syntactically similar patterns. In contrast, our approach aims to generalize code semantics based on new insights by introducing a graph embedding model which encompasses notions of local control-flow and data-flow in a novel way. We propose a graph-based representation for binary programs, that when used with a Graph Convolutional Network (GCN) (Kipf and Welling 2017), captures semantic properties of the program.
Our main contributions are: (i) To the best of our knowledge we are the first to suggest a distributed representation learning model approach for binary executable programs that is demonstrated to work for different downstream tasks;(ii) To this end, we present a deep learning model for modelling binary executable programs' structure, computations, and learning their representations; (iii) To prove the concept that distributed representations for binary executable programs can be applied to downstream programs analysis tasks, we evaluate our approach on two distinct problems -functional algorithm classification (i.e., the task of recognizing functional aspects of algorithmic properties, as opposed to their syntactic aspects) and vulnerability discovery across multiple vulnerability classes, and show improvement over current state-of-the-art approaches on both.

Related work
Many tasks that rely on the analysis of binary executables are frequently approached by rule-based systems and manually defined heuristics (Aafer et al. 2013;Santos et al. 2009;Karbab et al. 2018;Yamaguchi et al. 2014;Rawat and Mounier 2012;Cha et al. 2012). Machine learning has a proven reputation for boosting performance compared to heuristics and there has been a lot of interest in applications of machine learning to security tasks. We briefly discuss previous work in binary program analysis that relies on machine learning. We structure the literature based on the types of features extracted and by the type of the embedding model applied.

Hand designed features
Designing and extracting features can be considered equivalent to manually crafting representations of binaries. We can classify such approaches based on which form of the compiled binary program was used to extract the features.

Code-based features
The simplest approach to representing a binary is by extracting some numerical or textual features directly from the assembly code. This can be done by using n-grams of tokens, assembly instructions, lines of code, etc. N-grams are widely used in the literature for malware discovery and analysis (Li et al. 2019a;Lee et al. 2018;Kang et al. 2016), as well as vulnerability discovery (Pang et al. 2015;Murtaza et al. 2016). Additionally, there have been efforts focusing on extracting relevant API calls or using traces of system calls to detect malware (Wu et al. 2016;Kolosnjaji et al. 2016).

Graph-based features
Many solutions rely on extracting some numerical features of Abstract Syntax Trees (ASTs), Control Flow Graphs (CFGs) and/or Data Flow Graphs (DFGs). We combine these under models with graphbased features. discovRE (Eschweiler et al. 2016), among other features, uses closeness of control flow graphs to compute similarity between functions. Genius  converts CFG into numeric feature vectors to perform cross-architecture bug search. Yet other works have used quantitative data flow graph metrics to discover malware (Wüchner et al. 2015).

Learned features
Besides manually crafting the representations it is also possible to employ neural models for that purpose. This allows expressing and capturing more complicated relations of characteristics of code. Here we can classify the approaches based on whether they use sequential neural networks or graph neural networks.

Sequence embeddings
The body of work on the naturalness of software (Hindle et al. 2016;Ray et al. 2016;Allamanis et al. 2018) has inspired researchers to try applying NLP models for security applications in general, and binary analysis in particular. Researchers have suggested serializing ASTs into text and using them with LSTMs for vulnerability discovery (Lin et al. 2017a). Some of previous vulnerability discovery efforts also use RNNs on lines of source code (Li et al. 2018). More recently, INNEREYE proposed to use LSTMs in a Siamese architecture for binary code similarity detection (Zuo et al. 2019). The closest to our work is that of (Ding et al. 2019), which is starting by constructing a graph that is enriched with selective callee expansion. The authors then sample random walks from this graph to generate sequences of instructions and train a paragraph-to-vector model on these sequences. This approach is similar in spirit to earlier graph embedding approaches, such as Deep-Walk (Perozzi et al. 2014), that were sampling random walks of nodes and using word embedding models on sequences of adjacent nodes for representation learning. However, today these approaches for graph embedding are no longer popular, as graph neural networks based on message passing and neighbourhood aggregation have been shown to perform much better.
Graph embeddings Graph embedding neural models are a popular choice for tackling binary code-related tasks because the construction of Control Flow or Data Flow Graphs is frequently an intuitive and well-understood first step in binary code analysis. For instance, graph embedding models have successfully been used on top of Control Flow Graphs for tackling the task of code clone detection in source code and binary programs (White et al. 2016;Xu et al. 2017;Li et al. 2019b;Zhou et al. 2019). From these, Gemini ) uses a Siamese architecture on top of a graph embedding model for binary code clone detection task. The graphs they use -attributed control flow graphs, or ACFGs, -are CFG graphs that are enriched with a few manually defined features. In our work, instead of enhancing the basic blocks in CFG with a few attributes, we suggest enriching them by expanding the computations in each basic block into a computational tree, and rely on the fact that the graph embedding model will be able to capture attributes like the number of instructions if necessary. Graph Matching Networks (GMNs), (Li et al. 2019b) on the other hand, are based on the idea that instead of computing an embedding and then either using a distance function, or a Siamese network for the comparison, it might be beneficial to directly compare two graphs. So, as opposed to Gemini, where representations for known vulnerable or benign programs were pre-computed, GMN needs to compute similarity for every pair of programs individually starting from scratch. They demonstrate that this approach has better performance compared to Siamese architectures, but it is clearly slower, and more importantly for us -it does not produce program embeddings.
Other research using graph structure of binary programs include using Conditional Random Fields on an enhanced Control Flow Graph for attempting to recover the debug information of the binary program (He et al. 2018).

Model
We start by converting the binary executable to a program graph that is designed to allow mathematically capturing the semantics of the computations in the program. Next, we use a graph convolutional neural network to learn a distributed representation of the graph. Below we describe the process of constructing the program graph, followed by a brief introduction to how graph convolutional neural networks work. We also describe the baseline model that we use for evaluation and comparison, alongside previous existing approaches.

Program graphs
We start by disassembling the binary program and constructing a control flow graph (CFG). We use static interprocedural CFGs, which we construct using the angr library (Shoshitaishvili et al. 2016).
The fact that each basic block in CFG is executed linearly allows us to unfold the instructions within each basic block and represent them as a directed, computational tree, similar to an Abstract Syntax Tree (AST). The result of this process is schematically depicted in Fig. 1a.
Within each basic block, computations do not necessarily all depend on each other. There may be chunks of code that can be reordered inside the basic block without affecting the final result. In this case the approach described so far yields a forest of computations. To connect the trees in the forest we add Source and Sink nodes at the beginning and at the end of each basic block as a parent, or correspondingly a child, for all the trees generated from that basic block, which is demonstrated in Fig. 1b. The resulting graphs are then connected following the same topology that basic blocks originally had in the CFG, as shown in Fig. 1c.
We construct the above-mentioned computational trees from VEX intermediate representation (IR) of the binary. Figures 2 and 3 provide demonstration of the process.
Every node of the resulting tree is thus labelled with a constant, a register, a temporary or an operation. The edges of the tree are directed from the argument to the instruction. Within each basic block we reuse nodes that correspond to addresses, temporaries, constants and registers to tie together related computations. VEX IR provides Static Single Assignment form (SSA). This means that each assembly instruction in a basic block is lifted and "spilled" into multiple IR statements operating on temporary variables that are each used only once (the goal being to make all side effects of an instruction explicit). However, VEX does not track instances of different definitions and uses of the same register across instructions within the basic block, which we implemented to ensure we do not introduce fake data-dependence edges. In our implementation, if an instruction overrides or redefines the content of a register, its subscript is incremented. For example, for the eax register, we start from eax_0 and increment it to eax_1. This is necessary so that we do not reuse the same node for eax_0 and eax_1.
As a last step, we remove redundant edges and nodes, particularly, the Iex_Const node that follows Fig. 2 An example of the program graph. Parts of the graph are highlighted in the same color, as instructions on lines 1 and 2, to demonstrate where those instructions were mapped to in the graph every constant, and chains of Iex_WrtTmp → 't%' → Iex_RdTmp 1 . This is demonstrated in Fig. 3.
After the graph construction is complete, we remove SSA indices for temporary variables and registers to reduce the number of distinct labels. Here we show how redundant instructions were removed to contract the graph. On the left the graph is shown before the contraction, and on the right it is demonstrated after the contraction exception of the position that corresponds to the label of the node in our fixed ordering of labels. This representation is known as a one-hot representation. We will further refer to the feature matrix as X. Note that we use words "feature" or "features", "embeddings" and "representation" interchangeably.

Graph convolutional networks
The model we used for learning representations is a Graph Convolutional Neural Network (GCN) (Kipf and Welling 2017). Graph neural embeddings is a fast developing field, and some alternative graph representation learning models include GraphSAGE (Hamilton et al. 2017) or Gated Graph Neural Networks  , as well as a number of others. In the literature GCNs consistently perform on par or better than more recent variants of graph neural networks (Monti et al. 2017;Liu et al. 2019;Velickovic et al. 2017;Chen et al. 2018), while being simpler and oftentimes, faster. We chose GCN because it provides a good trade-off between simplicity, performance, and speed. The latter is important due to the low-level nature of the binary code; it is reasonable to expect the program graphs to grow quite large, which forces us to favour a model with weight updates that can be efficiently computed in batches.
GCN consists of a few stacked graph convolutional layers. A graph convolutional layer is applied simultaneously to all nodes in the graph. For each node, it averages the features of that node with features of its neighbours. Features of different nodes are scaled differently in the process of averaging and these weights are learned, i.e. they are the parameters of the graph convolutional layer. After the averaging, each node is assigned the resulting vector as its new feature vector and we proceed to either apply a different graph convolutional layer, or compute the loss and perform backpropagation to update the parameters.
Formally, this process of computing new feature vectors, known as forward pass or propagation, for (l + 1)-st graph convolutional layer can be described as follows: whereÃ is the adjacency matrix of the graph with added self-loops, D is its diagonal out-degree matrix, ReLU(x) = max(0, x) is the non-linearity or activation function, H (l) is the result of propagation through previous layer, H (0) being X, and W l is a layer-specific trainable weight matrix. Since one graph convolutional layer averages representations of the immediate neighborhood of the node, after performing k graph convolutions we incorporate the information from k-th neighborhood of the node.
From our description, it follows that after the forward pass, the graph convolutional network outputs features for each node in the graph. We will refer to this new feature matrix as Z. Note that Z still has n rows -one row per node, but it can have a different number of columns.
To get the representation of the entire graph, we can aggregate the features of all nodes in the graph. Here it is possible to use any aggregation function -summation, averaging, or even a neural attention mechanism, but in our experiments we went for a simple sum aggregate. A schematic illustration of this entire process is available in Fig. 4.
The aggregated representation is used with a twolayer perceptron, and passed through a softmax which is defined like softmax( We frame our tasks as classification and use cross-entropy error as the objective function for the optimization. We cover our procedure for selecting hyperparameters for GCN model in more detail in "Task 1. experimental setup" section.

Baselines
We wanted to compare our proposed representation to another task-independent representation, in particular, to one that used code-based features or embeddings. We experimented with Long Short Term Memory (LSTM) neural networks and Support Vector Machine (SVM) classifiers for that purpose. We interpreted instructions as words, and a sequence of instructions as a sentence, following a number of similar approaches in the field, e.g. (Zuo et al. 2019). We experimented using both SVM and LSTM with the assembly instructions directly, as well as with the code lifted to VEX IR. From our experiments, an SVM classifier with a Gaussian kernel and bag-of-words representation of VEX IR gave us the best performance, so that is the setup we chose as a baseline. Each line of IR is tokenized to be a single "word". Vocabulary for the bag-ofwords was obtained from the training part of the dataset. We used frequency thresholding to remove infrequent entries and reduce data sparsity. Those frequencies were empirically found on the validation part of the dataset.

Task description
We evaluate the performance of our proposed representations on two independent tasks. In the first, we test the proposed representations for functional algorithm classification in binary executable programs through classifying coding challenges. In our second task, we want to demonstrate the performance of learned representations on a common security problem -discovery of vulnerable compiled C/C++ files. The two tasks are semantically different and we demonstrate in the later sections that both can be successfully tackled with representations constructed and learned in the same way.

Task 1: Functional algorithm classification
Algorithm classification is crucial for semantic analysis of code. We qualify it as "functional" by opposition to "syntactic", i.e., we aim to capture the semantics of functional properties of algorithms. It can be used for creating assisting tools for security researchers to understand and analyze binary programs, or discover inefficient or buggy algorithms, etc.
In this task, we are looking at real-world programs submitted by students to solve programming competition problems. We chose such a dataset because the programs in it, being written by different students, naturally encompass more implementation variability than it would be possible to get by using, for instance, standard library implementations. Our goal is to classify solutions by the problem prompts that the solution was written for.
We present a typical example of programming competition problem prompt in Table 1. Provided example is for illustrative purposes only, as it is taken from ACM Timus (http://acm.timus.ru) and is not part of our dataset 2 .
From our definition and the dataset, it follows that we define the equivalence of two programs as them solving the exact same problem. Hence, in this task, we test the ability of the model to capture the higher-level semantic similarity, and to take into account program behaviour, functionality and complexity, while ignoring syntactic differences wherever possible.

Task 2: vulnerability discovery
Software contains bugs, which in the worst case can lead to weaknesses that leave vulnerable systems open to attacks. Such security bugs, or vulnerabilities, are classified in a formal list of software weaknesses -Common Weakness Enumeration (CWE). Vulnerability discovery is the process of finding parts of vulnerable code that may allow attackers to perform unauthorized actions. It is an important problem for computer security. The typical target of vulnerability discovery is programming mistakes accidentally introduced in benign commodity programs by their authors. Our work excludes software specifically crafted to behave in a malicious way, and focuses on benign programs. Due to the large variability among Most vulnerability discovery techniques rely on dynamic analysis for program exploration, the most common one being fuzzing (Zalewski 2017). Such models offer a high level of precision, at the cost of shallow program coverage: only a subset of execution traces for a given program (along with a set of input test cases) can be observed in finite time, leaving large parts of the program unexplored. On the other hand, static analysis provides better program coverage at the cost of lower precision. In addition to these challenges come a range of fundamental problems in program analysis related to undecidability (e.g., the halting problem, i.e., "Does the program terminate on all inputs?") and implementation. These issues emerge because vulnerabilities may span very small or very large chunks of code and involve a range of different programmatic constructs. This raises the question -at what level of granularity in the program should we inspect them for vulnerabilities or report to security researchers. In this work, we are concerned with the question of learning representations for the entire binary program that will help to discover vulnerabilities statically, while leaving the questions of handling large volumes of source code and working on variable levels of granularity for future work. Our work builds on standard binary-level techniques for control-flow recovery (i.e., the reconstruction of a CFG), which is a well-studied problem Table 1 An example prompt for programming competition problems and their corresponding problem numbers and names. The example is taken from ACM Timus http://acm.timus.ru/

Prompt Problem #
You have a number of stones with known weights w 1 , . . . w n . Write a program that will rearrange the stones into two piles such that weight difference between the piles is minimal

Stone Pile
where state-of-the-arts models perform well with high accuracy and scalability (Andriesse et al. 2016).

Datasets and experimental setup
Our first dataset, introduced by Mou et al. (2016), consists of 104 online judge competition problems and 500 C or C++ solutions for each problem submitted by students. We only kept the files that could be successfully compiled on a Debian operating system, using gcc8.3, without any optimization flags. This left us with 49191 binary executable files, each belonging to one of 104 potential classes. Each class in this dataset corresponds to a different problem prompt and our goal is to classify the solutions according to their corresponding problem prompts.
The second dataset we used is the Juliet C/C++ test suite (Boland and Black 2012). This is a synthetically generated dataset, that was created to facilitate research of vulnerability scanners and enable benchmarking. The files in the dataset are grouped by their vulnerability type -CWE-ID. Each file consists of a minimal example to recreate the vulnerability and/or its fixed version. Juliet test suite has OMITGOOD and OMITBAD macros, surrounding vulnerable and non-vulnerable functions correspondingly. We compiled the dataset twice -once with each macro, to generate binary executable files that contain vulnerabilities and those that do not. The dataset contains 90 different CWE-IDs. However, some of them consist of Windows-only examples, that we omitted. Note that even though our approach is not platform-specific, in this work we limit our experimentation to Linux only.
Most CWE-IDs had too few examples to train a classifier and/or to report any meaningful statistics on 3 . Thus, we also omitted any CWE-ID that had less than 100 files in its testing set after 70:15:15 for training:validation:test split, because for those cases the reported evaluation metric would be too noisy. As a result, we experimented on vulnerabilities belonging to one of 30 different CWE-IDs, presented in Table 2. We trained a separate classifier for every individual CWE-ID, which was required because files associated with each CWE-ID may or may not contain other vulnerability types.
We trained the neural network model with early stopping, where the number of training epochs was found on the validation set.

Task 1. experimental setup
For experiments in the functional algorithm classification task, we randomly split all the binaries in the first dataset into train:test:validation sets with ratios 70:15:15. We use the training set for training and extracting some additional helper structures, such as vocabulary for the bag of words models and counting frequencies for thresholding in neural network models. We use the validation set for model selection and finding the best threshold values. After finding the best model, we evaluate its performance on the testing set. The experiments are cross-validated and averaged over 5 random runs.
For SVMs, in the model selection phase, we perform a grid-search over the penalty parameter C and pick a value for the vocabulary threshold to remove any entry that does not have a substantial presence in the training set to be useful for learning. After the trimming our vocabulary contains about 10-11K entries (the exact number changes from one random run to another).
For GCN-based representation, we follow similar logic and use the training set to find and remove infrequent node labels. Here too the exact threshold is decided via experimentation on the validation set. On average, we keep about 7-8K different node labels. Very infrequent terms are replaced with a placeholder UNK, or CONST if it is a hexadecimal.
We pick hyperparameters of the GCN model by their performance on the validation set. Figure 5 demonstrates the influence of the depth (number of graph convolution layers) and width (size of each graph convolution layer) on the performance of the model for Task 1. Figure 5(A) shows the peformance of models with depths from 1 to 8 layers, while the dimensionality of every layer is set to 64. As it can be seen, increasing the depth of the model up until 4 layers improves performance, however additional layers after that do not always improve performance. Figure 5(C) compares performances of four models where each model has the same number of layers (3), but different sizes of layers -32, 64, 128 or 256. From here we see, that increasing the size of the layers from 64 to 128 provides a moderate improvement, but increasing the size further does not affect the performance. Figures 5(B) and (D) show the duration of training in seconds of each of the discussed above models on 100 examples. Based on these general findings we perform some additional experimentation and deploy a GCN with 3 layers, that has 128 dimensions in its first two layers, and 64 dimensions in its last layer.

Task 2. experimental setup
In the vulnerability discovery experiments, we train a separate classifier for each of 30 different CWE-IDs. Note, that for each CWE-ID classifier in its training and testing we only include the binaries that are specifically marked as good or bad with regard to that CWE-ID. For every CWE-ID, we split its corresponding binaries into train:validation:test with ratios 70:15:15, and report results averaged over 5 random runs. We use training sets for training the models and validation sets for grid search of the penalty parameter C in SVMs. We report the performance of the best model measured on testing sets. Here we reuse some statistics obtained on the first dataset, in particular, we reuse frequency thresholds and bag-ofwords vocabularies. We need to train a separate classifier for each CWE-ID, 30 SVM classifiers and 30 NN classifiers in total, which would lead to a huge search space at the phase of the model selection.
We are not aware of related work on vulnerability discovery that performs their evaluation on Juliet Test Suite. Thus, to give the readers a better understanding of how our proposed model would fare compared to other existing approaches, we performed an additional experiment using Asm2Vec model (Ding et al. 2019) on the Juliet Test Suite. Asm2Vec is a clone search engine that relies on vector representations of assembly functions. In the original paper the authors suggested its usefulness as a vulnerability detection tool which allows finding duplicates of known vulnerable functions. We tried replicating that scenario as faithfully as possible, by training Asm2Vec 4 on Juliet Test Suite, and comparing resulting representations to differentiate between vulnerable and non-vulnerable instances. Since Asm2Vec poses the vulnerability detection as a retrieval problem, we follow their example in the paper and report Precision@15 metric. For each vulnerable function, we find 15 most similar functions to it according to cosine similarity and compute the percentage of vulnerable functions among them. It is worth noting that we are looking for similar functions among all vulnerable and non-vulnerable functions per CWE-ID.
We set most of the hyperparameters of Asm2Vec following the original paper, but finetune for dimensionality of the representation and learning rate. To find best values for those we use grid search in intervals {50,100,150,200} and {0.05, 0.025, 0.01} correspondingly. The final results that we report for Asm2Vec are computed on the testing set, and are the average of 5 random runs.

Evaluation and results
For evaluating performance in our experiments we used accuracy following previous work that we proceed to compare our results to. Table 3 contains quantitative evaluation of our representation for Task 1. Our proposed representation outperforms our own SVM baseline, TBCNN model (Mou et al. 2016), and current state-of-the-art for this task -inst2vec (Ben-Nun et al. 2018). We manage to reduce the error by more than 40%, thus setting a new state-of-the-art result. It  should be additionally mentioned that both TBCNN and inst2vec start from the C source code of the programs to make predictions, whereas our baseline SVM and our proposed model are only using compiled executable versions. Highlighting a few important differences between our approach and inst2vec helps better understanding some of the contributions of our approach. To construct the contextual flow graphs, the authors of inst2vec compile the source code to LLVM IR, which contains richer semantic information than VEX IR that we use in this work. Because it is more high-level, LLVM IR is a difficult target for lifting from binary executable files 5 .

Task 1
Another key difference is that instead of learning the representations of individual tokens and then combining the tokens into a program using a sequential model, we learn the representations of all the tokens in the program jointly, thus learning the representation of the entire program. The inst2vec, on the other side, ignores the structural properties of the program at that step. Our results show that we can achieve better performance, despite inst2vec starting from a semantically richer LLVM IR. We believe this indicates the importance of using the structural information at all stages of learning for obtaining good program embeddings. Figure 6 contains the evaluation of our representation for Task 2. Here, the classifier based on our proposed representation outperforms our SVM baseline in all cases except 2 -CWE-ID590, Free of Memory not on the Heap, and CWE-ID761, Free of Pointer not at Start of Buffer. In both cases we are seeing less than 5% difference in accuracy. On the other hand, our proposed representation demonstrates a significant gain in terms of performance. In the extreme case of CWE-617, Reachable Assertion, it outperforms the baseline by about 25%, in many other cases the gain is from 10% to 20% of prediction accuracy. Table 4 reports the results we obtained from running Asm2Vec on Juliet Test Suite. It is important to keep in mind that these numbers are not directly comparable to our results, as they correspond to two different metrics. Rather, this experiment demonstrates the complexity of the dataset and the capacity of Asm2Vec to capture vulnerabilities on it. While Bin2Vec achieves more than 80% accuracy for all CWE-ID, Asm2Vec has Precision@15 equal to 0.5 or 0.6 in many cases, which means only about half of the retrieved similar functions were in fact vulnerable. Asm2Vec has highest Precision@15 of 0.77 for CWE-ID 416, Use After Free, which corresponds to about 1 in 4 retrieved functions being incorrectly labelled as vulnerable. For comparison, for the same vulnerability type Bin2Vec achieves near perfect performance.

Task 2
Additionally, we can indirectly compare our results for the second task with those presented in two surveys that use Juliet Test Suite as a benchmark for evaluating commercial static analysis vulnerability discovery tools (Velicheti et al. 2014;Goseva-Popstojanova and Perhinschi 2015). It must be noted, that the commercial tools in those experiments probably did not use most of the programs for each CWE-ID as a training set. Additionally, the tools considered in those surveys are making their predictions based on source code and not binaries. Nevertheless, the comparison of the reported accuracies in those surveys with ours tells us that our proposed representation performs better for vulnerability discovery than static analysis commercial tools. For example, on CWE-IDs from 121 to 126, which are all memory buffer errors, (Velicheti et al. 2014) report less than 60% accuracy, whereas our model scores higher than 80% for each of those CWE-IDs. For tools studied in Goseva-Popstojanova and Perhinschi (2015), our model consistently outperforms three out of four static analysis tools, and for the last one it outperforms it by a considerable margin in all cases but two. Those two are CWE-ID122, Heap-based Buffer Overflow, where the commercial tool scores a few percents higher, and CWE-ID590, Free of Memory not on the Heap.
These results suggest that our representation has good prospects to be used in vulnerability discovery tools. For almost every vulnerability type our prediction accuracy performance is better than 80% and for many it is higher than 90%.

Discussion and future work
Software in production is usually complex and large, capable of performing many different functions in different use cases. On the contrary, programs in our evaluation datasets are single-purpose, solving a single task with a relatively small number of steps. Additionally, the entirety of each program in Juliet test suite is relevant to vulnerability discovery tasks, unlike real software where most of the code is not vulnerable and only a small part of it may have an issue. This can potentially be solved by introducing representations that can be computed on different levels of coarseness. This is a non-trivial task, but our findings hint that once completed we may be able to Experimental results for vulnerability discovery on the Juliet test suite achieve far better results for different problems on production software than is currently possible. Additionally, we need to get a better understanding of what properties are captured with such a representation and how is best to use those or how to add other desirable properties. Another challenge left for future work is extending this approach to cross-architecture and cross-compiler binaries.
There are several avenues for extending our work. First, it will be interesting to see whether using recent extensions of GCNs, such as the MixHop model (Abu-El-Haija et al. 2019) that propagates information through higherorder node neighbourhoods, will result in better performance. Additionally, to test the utility of Bin2Vec in real-world problems, we would like to apply it to analyze more complex and larger-scale vulnerability datasets.

Conclusion
We introduced Bin2Vec, a new model for learning distributed representations of binary executable programs. Our learned representation has strong potential to be used in the context of a wide variety of binary analysis tasks. We demonstrate this by putting our learned representations to use for classification in two semantically different tasks -algorithm classification and vulnerability discovery. We show that for both tasks our proposed representation achieves better qualitative and quantitative performance in comparison to state-of-the-art approaches, including inst2vec and common machine learning baselines.