Skip to main content

GLDOC: detection of implicitly malicious MS-Office documents using graph convolutional networks

Abstract

Nowadays, the malicious MS-Office document has already become one of the most effective attacking vectors in APT attacks. Though many protection mechanisms are provided, they have been proved easy to bypass, and the existed detection methods show poor performance when facing malicious documents with unknown vulnerabilities or with few malicious behaviors. In this paper, we first introduce the definition of im-documents, to describe those vulnerable documents which show implicitly malicious behaviors and escape most of public antivirus engines. Then we present GLDOC—a GCN based framework that is aimed at effectively detecting im-documents with dynamic analysis, and improving the possible blind spots of past detection methods. Besides the system call which is the only focus in most researches, we capture all dynamic behaviors in sandbox, take the process tree into consideration and reconstruct both of them into graphs. Using each line to learn each graph, GLDOC trains a 2-channel network as well as a classifier to formulate the malicious document detection problem into a graph learning and classification problem. Experiments show that GLDOC has a comprehensive balance of accuracy rate and false alarm rate − 95.33% and 4.33% respectively, outperforming other detection methods. When further testing in a simulated 5-day attacking scenario, our proposed framework still maintains a stable and high detection accuracy on the unknown vulnerabilities.

Introduction

APT attacks have intensified in recent years and have become a huge threat to the cyberspace security. According to FireEye, more than 91% of APT attacks originate from phishing emails. Hackers use social engineering to lure users into clicking on malicious attachments in a sophisticated email. Among these crafted attachments, more than 65.4% are MS-Office documents (Global Advanced Persistent Threat APT Research Report 2020). That is to say, malicious MS-Office document has become one of the most effective attacking vectors in APT attacks (Top 10 vulnerabilities used by APT organizations in recent years. xxxx; A roundup of the world's top 10 APT attacks in 2018).

Facing with such threats, Microsoft Office provides some mechanisms to reduce the possibility of attack, including macro security level, trusted locations, and digital signatures. However, these mechanisms have been proved easy to bypass (Nissim et al. 2017), in addition that they mainly defense against macro virus while perform mediocre when facing maldoc (maldoc refers to malicious MS-Office document in this paper). More importantly, when malicious document comes, victims always open the attachments directly for trusting too much on defense capability of security equipment. Since such document seldomly exposes its malicious behavior, users have always been hacked for a long time without any noticing. Figure 1 shows an example when hackers attack via malicious documents.

Fig. 1
figure 1

An example of attacking via malicious document

Many studies have been focused on the detection of maldocs. Currently, the two main methods are static detection and dynamic detection (Yu et al. 2021). Static detection is performed by extracting the static features of a document, while dynamic detection determines whether a document is malicious by supervising its dynamic behavior exposed. Comparing these two, the static method is always fast-speed and friendly deployed, but meanwhile easy to bypass by obfuscation. Moreover, it doesn’t perform well when attack happens from remote loading of malicious programs (Yu et al. 2021). In terms of dynamic detection, the vast majority of relevant research center on the dynamic behavior of malicious JavaScript code in PDF documents (Nissim et al. 2014). However, these studies are not applicable to MS-Office documents (Nissim et al. 2014). The few other research that focuses on system calls during document’s opening ignore indirect attacks, where hackers only use maldoc as a “man” in the middle but exploit in other processes, resulting in a blind spot for detection (Scofield et al. 2017; Meng and PlatPal 2017; Jiang et al. 2021). Furthermore, whether static or dynamic detection basically focuses on the discovery of known threats, while works mediocre in the detection of unknown threats or vulnerabilities, let alone those which expose implicitly malicious behaviors in APT attacks. For example, we upload a malicious file to VirusTotal (Report for prepared file in VirusTotal xxxx), a popular website which provides tens of antivirus engines, and find only 16 out of 62 engines mark it as a security threat, and only 5 out of 25 identify it as malicious when it comes to Threatcn (Report for prepared file in Threatcn xxxx).

To characterize such documents, we define a type of maldoc, named implicitly malicious document, also im-document, in this paper as our study object, to describe those vulnerable MS-Word documents which expose few obvious malicious behaviors. We sum up the following features of im-documents:

  • Im-documents show few distinct malicious behaviors externally, and seldom utilize mature attacking methods. For instance, they usually don’t directly escalate system privileges, add administration users or execute malicious programs through remote loading;

  • The exploitation of many im-documents may be incomplete. For example, in the process of post penetration, some malware or trojan fails in reverse connection and command execution when traffic gets intercepted by monitoring device or connection domain name gets expired. But it still brings danger since the internal exploitation may have been accomplished and the vulnerability has already been triggered once the maldoc is opened.

  • Im-documents might escape detection of most public antivirus engines, usually less than 30% of which in VirusTotal marked as malicious.

It is necessary to make research and analysis of im-documents for the following reasons:

  • The previous blocked attack within im-documents may be reactivated in some cases. As we have mentioned before, this maldoc (Report for prepared file in VirusTotal xxxx) comes from a typical spear phishing, and utilizes CVE-2021-40444 to fulfil remote command execution, but fails to further penetrate because the malicious code downloading domain gets expired. We attempt to visit this web service but find this malicious website has been shut down. However, there is a possibility that, if a hacker succeeds in domain takeover (Domain takeover report in Hackerone xxxx) and registers it again, the "dead" im-documents may be resurrected. What’s more, since most engines have already recorded its hash value and marked it as a maldoc, the reactivated im-document may in fact brings a secondary attack of "MD5 poisoning" (An update on MD5 poisoning. xxxx);

  • The study of im-document may be helpful to find unknown vulnerabilities or threats. In recent years, hackers incline to create maldocs with logical vulnerabilities in APT attacks. Once triggered, it would not damage the program memory like buffer overflow, and barely shows obvious malicious behaviors, which makes traditional IDS difficult to discover any attacks. If we could obtain new threat or attack patterns, it will be taken as a supplement to manual vulnerability mining;

  • The study may be helpful for us to better understand the essential differences between malicious and benign samples. In fact, no matter how carefully it hides its true purpose, the im-document still behaves differently with normal documents during runtime. What we study is actually drawing a clearer boundary of these two types, which also assists the development of research on maldoc detection.

This paper presents GLDOC – a GCN based framework that is aimed at detection of implicitly malicious Microsoft Word files. GLDOC adopts dynamic analysis and focuses on improving possible blind spots of past detection methods. Different with researches which only favor system call trace (Scofield et al. 2017; Meng and PlatPal 2017; Jiang et al. 2021), we capture all behaviors during the maldoc’s opening in the sandbox, and further take the process trees into consideration. To deeper extract features from dynamic behaviors, we introduce a Threat Evaluation Based Pooling Method to reconstruct the system call graph of Microsoft Word, meanwhile use Markov chain to describe relations within processes so as to re-draw process trees. Then we design a 2-channel GCN based network to learn the above 2 graphs respectively, as well as a classifier to complete the document classification.

The reasons we choose deep learning based methods are include: (1) Those approaches based on collective and statistical information are not effective in detecting unknown attacks. While deep learning based methods could handle unseen malicious data due to its generalization ability, and further discover unknown vulnerabilities. (2) GCN is suitable to solve malicious document detection problem for its ability to handle graphical structure. In fact, graph-based neural networks are widely used for graph classification task, which is exactly we try to utilize to distinguish maldocs from normal ones.

In summary, the contributions of this paper are as follows:

  • We focus on im-document and sum up its characteristics. Providing a possible attacking scenario, we also present why it’s meaningful and necessary to make researches on im-documents.

  • We design a GCN based framework for maldoc detection. As far as we know, it is the first time that graph based neural network has been applied in this field.

  • Considering both system calls and process trees, GLDOC covers the blind spots of detection in past researches and has generalization ability by means of neural networks. Experiments show that GLDOC has a comprehensive balance of accuracy rate and false alarm rate − 95.33% and 4.33% respectively, outperforming other detection methods.

  • Our proposed framework has the ability to discover unknown threats. Deployed in simulated attacking scenario, it maintains a stable and high detection accuracy and updatability when facing unknown vulnerabilities.

The rest of paper is organized as follows. Section II introduces the related works. Section III details the formulated problem and our proposed framework. Section IV exhibits our performance evaluation. Section V concludes the paper.

Related works

In recent year, malicious document detection has aroused many researchers’ interests. As we have mentioned before, the detection methods could be divided into 2 categories: static detection method and dynamic detection method.

When it comes to static detection method, researchers always focus on collective and statistical information of files, and extract features from byte stream (Li et al. 2007; Liu et al. 2019), content (Li et al. 2010; Hong et al. 2022) or structural information (Maiorca et al. 2012;  Kim et al. 2016) to detect malicious documents. WJ. Li et al. (2007) build feature library through statistics of bye stream. And then process with n-gram and complete detection through measuring the distance of normal documents and malicious ones. Liu et al. (2019) compute variance, energy spectrum and entropy of byte stream as global features, then adopt machine learning models to fulfil maldoc detection. More study puts eyes on embedded malicious code or structure of files. Li et al. (2010) reconstruct the shellcode through feature extraction and tries to find threats by cosine similarity comparing. With tokenization and vectorization, Hong et al. (2022) extract plain-text features from the corpus of documents and attempts to discover malicious documents by machine learning methods like SVM, DNN and etc. Mimura and Taro (2020) focus on unknown threat in VBA macros and try to adopt LSI to calculate the similarity among different documents. With deep understanding of PDF structure, Maiorca et al. (2012) present a feature extractor module as well as an effective classifier. ALDOCX (Nissim et al. 2017) extracts feature from xml-based files after unzipping MS-Office documents, aiming at accurate detection of unknown maldocs through active and online learning.

Static detection method always extract features from statistical information, but face the common problem of being bypassed by attackers. Moreover, they always adopt shallow learning methods, most of which lack the ability of generalization. Though efficient and convenient, it sometimes ignores the sequential information and code semantics, which makes it difficult to detect maldocs with unknown vulnerabilities.

When it comes to dynamic detection, researchers try to detect malicious documents by analyzing its malicious behaviors. Li et al. (2007) define some malicious behaviors to detect maldoc, including abnormal DLL searching order, dialog window popping out, and checking if malicious behavior maps the existed feature library. With restricted scenario, this method can only be taken as the supplement to static analysis. Moreover, some methods focus on system call trace and process tree to detect malicious samples (Kosoresow and Hofmeyer 1997). Scofield et al. (2017) capture all system calls during the runtime of Adobe Reader and defines the feature as < Program Name, Action, Object > . Collecting a comprehensive set of merged exemplar features, it proposes a classifier to detect malicious digital documents. Xu et al. (2016) record all dynamic behaviors and construct them as a system call graph, trying to detect malware in Android OS. They use different methods to represent graph, including histogram, N-gram and Markov chain, and finally use graph kernels to compute pair-wise similarities for classification. Work in Khan et al. (2017) presents a novel framework using a process tree based temporal directed graph, to detect the malicious behaviors in Microsoft Windows.

The above dynamic methods still have some shortcomings. When detecting by system call trace or process tree, they only care about single aspect, leading to a blinding spot for it cannot discover attacking in other processes. When detecting by malicious behaviors, some method only considers several limited behaviors and ignores their internal relations among them. Moreover, many existed methods are based on Android OS or focusing on PDF document, which is not applicable for MS-Office documents.

The proposed framework and methods

GLDOC consists of the following 3 modules:

  • Dynamic Behavior Capturing Module. We build a sandbox environment, and use VMware and Process Monitor to capture the dynamic behaviors when a MS-document is opened.

  • Graph Reconstruction Module. We first build graphs from the acquired traces and then reconstruct them respectively: (i). design a Threat Evaluation Based Pooling Method (TEBP Method) to reconstruct the System Call Graph of Microsoft Word (SCGMW); (ii). use Markov chain to describe relations within processes so as to re-draw the Graph of Process Trees (GPT).

  • Learning and Classification Module. We implement 2 channels to respectively train the above graphs which we have acquired before. A classifier is developed to detect if the document is benign or malicious.

The overall framework is illustrated in Fig. 2. We also give a specific example of CVE-2017–11882 in Fig. 3, showing how we handle the raw data from the Process Monitor. Figure 3a shows fragment of system call data while Fig. 3b shows fragment of process data. Figure 3c, d respectively show examples of graph representation. Figure 3e, f respectively give the original graph of SCGMW and GPT.

Fig. 2
figure 2

The overview of components in GLDOC

Fig. 3
figure 3

Raw data handling example of CVE-2017-11882

Dynamic behavior capturing module

We design an environment to capture the documents’ dynamic behavior. More technical details will be given in the evaluation chapter. In such environment, we use Process Monitor to acquire the whole behavior sequence during a period of runtime. We define \(action\) as each line of behavior sequence, including Time Sequence, Process Name, PID, Operation, Operation Object, TID, and etc. To further process the behavior sequence, we first delete lines with failed operation response, and meanwhile take care for keeping the original order for all actions. Then we extract PID and TID to construct GPT, and extract Operation and the Operation Object to construct SCGMW. For each document, we acquire over 10^6 lines of actions in average, about 2*105 of which is from MS-Office.

Graph reconstruction module

The proposed framework has 2 channels, where the above one produces SCGMW, while the below one produces GPT. In each channel, we collect the dynamic behaviors first to build the original call graph, and then reconstruct it as the input of next module. In reconstruction part, pre-processing is needed because malicious behaviors only occupy a tiny part of the benign behaviors. Data processing strengthens the graph features and helps our model better understanding the huge data.

System call graph of MS-Word

We extract action trace from winword.exe, with one action one line. Then we construct them as a directed graph \({G}_{\text{sys}}=({V}_{sys},{E}_{sys})\), where \({V}_{sys}=\{{v}_{{s}_{i}}\}\) denotes node set and \({E}_{sys}=\{ {\overrightarrow{e}}_{{v}_{{s}_{i}}{v}_{{s}_{j}}}\}\) denotes edge set. We define each node as < Operation, Operation Object > , i.e., the Operation combined with Operation Object to represent each node, and \({\overrightarrow{e}}_{{v}_{si}{v}_{sj}}\) corresponds to a directed connection in adjacent node pair, from the upper node \({v}_{{s}_{i}}\) to the lower node \({v}_{{s}_{j}}\).

The reconstruction of SCGMW has two parts: node encoding and edge weighting. We first introduce TEBP Method to encode the nodes of SCGMW. The main idea of TEBP method is to accomplish graph pooling through the threat evaluation of each node. We do pooling because when counting the size of SCGMW from 500 randomly selected documents, we find almost all of them have over 5000 nodes. While in GCN, graphs with a small number of nodes achieve much better performance than these with a large number of nodes (Tan et al. 2023; Kipf and Welling 2016).

To describe the risk level of each node, we first divide the Operation Object into 3 items: Registry Key, File & Directory, and Network Connection. Estimating each object’s risk level, we map it into normal, low-risk or high-risk behavior. The items are as follows, which are shown from Tables 1, 2 and 3.

Table 1 Vulnerable registry key examples
Table 2 Top 4 file & directory
Table 3 Network connection examples

Registry Key. Registry key is the most common target for post penetration and long-term control. We collect 26 different vulnerable registry keys, and set 1–4 respectively to label the normal keys, low-risk keys, middle-risk keys and high-risk keys (Long-lasting exploitation of backdoors xxxx; Security basis: response and feedback xxxx). For high-risk level, we collect 16 keys, which relates to account adding, auto-start and scheduling, privilege escalation, external program loading and etc. We believe attack happens in a high possibility when these high-risk keys get edited or modified during the document’s opening. For middle-risk level, we collect 5 keys which relates to domain hijacking, DLL searching order, folder sharing and etc. Finally for low-risk level, we collect 5 keys considering application security, which relates to homepage tampering, disk hiding, desktop icon hiding and etc.

File & Directory. In fact, hackers could disguise a malware in any filename or extension. They even replace exited DLL or system files with malicious ones in APT attacks. That is to say, it is impossible to detect attacks only through file & directory operations. We believe all these operations should stay in an equal risk level. When counted in 500 randomly selected benign documents from dataset (Section IV provides more details of dataset), we found most File & Directory located in less than 10 paths. Therefore, we label the top 9 paths and the else group with 1–10, to tag the File & Directory item.

Network Connection. Actually, benign MS-Office documents seldom launch any suspicious network connection. We set 1–3 to label the normal connection, low-risk connection and high-risk connection respectively. We first set 2 to the internal network connection, as internal network is the commonly attacked target in lateral movement. Then we categorize the left ones into 2 parts: (1) target is a pure IP. We look up this IP address with WHOIS service. And then check whether this IP belongs to Microsoft Corporation. If yes, we set 1, which means normal connection. Otherwise, we set 3, which means a high-risk connection. (2) target is a domain name. We identify this domain according to the legal website list, where we collected from the scope of MSRC (Microsoft bounty program xxxx). With the consideration of both the primary domain and subdomain, we set 1 to normal connections on legal request list, and 3 to the left malicious ones.

Then we complete node encoding, where Table 4 gives more details of that. We first construct each action as a node and represent it as “Operation_ObjectItem_RiskLevel”. To encode it, we extract and build all nodes from randomly collected 800 documents in both benign and malicious types from the dataset (each type a half), consequently building a node dictionary with 78 words to mark them. The goal in doing so is to maintain the identical action get the uniform number among different documents. We encode each node with \((\begin{array}{ccc}{R}_{k},& {R}_{f},& {R}_{c}\end{array})\), where \({R}_{k}\) denotes the risk level of Registry Key, \({R}_{f}\) denotes the label in File & Directory, and \({R}_{c}\) denotes the risk level of Network Connection.

Table 4 Example in SCGMW reconstruction

After new coming behavior sequence gets encoded, we also set edge weight according to the Eq. (1), to better describe SCGMW.

$${W}_{{s}_{i}{s}_{j}}=Nom\left\{T({\overrightarrow{e}}_{{v}_{si}{v}_{sj}})\right\}$$
(1)

\(T({\overrightarrow{e}}_{{v}_{si}{v}_{sj}})\) denotes the number of edges between \({v}_{si}\) and \({v}_{sj}\) among the sequence. We define \({W}_{{s}_{i}{s}_{j}}\) as the normalized weight of \({\overrightarrow{e}}_{{v}_{si}{v}_{sj}}\). Here, we use inverse proportion as the normalization function \(Nom\) and map the weight to 0–1, since the less links within a node pair appear, the more possible this edge should be considered as the attacking-related connection. Finally, we delete extra edges in node pair and re-link them to construct the final system call graph.

Graph of process tree

As we have mentioned before, most attacks happened within the Microsoft Office, where most vulnerabilities locate. But there still exists possibilities that attack happens in other processes just using MS-Office as a “springboard”. A typical example in Office is CVE-2017-11882, which locates in winword.exe but then pops to MathType to conduct the following attacks. Therefore, in our framework, GPT needs to be considered for malicious document detection, as a supplement to the analysis of system call graph.

We denote the process set as \(P=\{{p}_{i}\}\), representing all captured processes in the system call trace. When extracting PID and TID trace from dynamic behavior sequence, we construct them as a directed graph \({G}_{pro}=({V}_{pro},{E}_{pro})\), where \({V}_{pro}=\{{v}_{{p}_{i}},{p}_{i}\in P\}\) is the node set and \({E}_{pro}=\{{\overrightarrow{e}}_{{v}_{{p}_{i}}{v}_{{p}_{j}}}, \, {v}_{{p}_{i}},{v}_{{p}_{j}}\in {V}_{pro}, {v}_{{p}_{i}} is parent process of {v}_{{p}_{j}}\}\) is the edge set. To ensure identical process remain the same encoding in different documents, we build a dictionary to mark process as we did previously for system call graph.

To better describe relations within processes, we use transition probability in Markov chain (Anderson et al. 2011) to set edge weight in GPT. Different with previous study (Xu et al. 2016), we applied Markov chain in process call instead of system calls. We believe: (1) Benign documents always act similarly; (2) Child process strongly correlates with its parents while remains independent with its grandparents.

Based on the above assumption, we define edge weight \({W}_{pipj}\) as the transition probability from \({p}_{i}\) to \({p}_{j}\).

$${W}_{{p}_{i}{p}_{j}}=\frac{Num({p}_{j})}{{\sum }_{t\in C({p}_{i})}Num(t)}$$
(2)

\(C({p}_{i})\) denotes the child process set of \({p}_{i}\), \(Num(t)\) means how many times \({p}_{i}\) call process t. \({W}_{{p}_{i}{p}_{j}}\) describes the possibility that \({p}_{i}\) would call \({p}_{j}\) when document is opened.

As for node feature, we set random value with the same dimension in SCGMW as the initial input. Because GPT mainly focuses on the call relation among processes, and system calls in other than MS-Office here cannot be taken as measurement to detect malicious behaviors.

Learning and classification module

GCN is first proposed in 2016 to solve classification problem in graph. Here, we utilize GCN to distinguish malicious documents with benign ones. Before detailed discussion on our framework, we briefly introduce how GCN works.

Basically, GCN is a type of convolutional neural network that can work directly on graphs and take advantage of their structural information. For each node, GCN obtains information from all its neighbors and also itself. Equations (3)–(5) denotes how information aggregates from layer to layer.

$${l}_{v}^{0}={x}_{v}$$
(3)
$${l}_{v}^{k}=\sigma \left({W}_{k}{\sum }_{u\in N(v)}\frac{{l}_{u}^{k-1}}{\left|N(v)\right|}+{B}_{k}{l}_{v}^{k-1}\right), \forall k\in \{\text{1,...,K}\}$$
(4)
$${z}_{v}={l}_{v}^{K}$$
(5)

Suppose we construct a K-layer GCN, the aggregation output can be described by Eq. (4), where \({l}_{v}^{k}\) means the aggregation output of node \(v\) for the kth time. \(\sigma\) means activation function, which is \(Sigmoid\), \(Relu\) in most cases. \(N(v)\) denotes all neighbors of node \(v\) while \(\frac{{l}_{u}^{k-1}}{\left|N(v)\right|}\) means average information collected from the previous layer. Consequently, Eq. (4) actually shows that the kth layer’s output of node \(v\) comes from k-1th layer’s input and all its neighbors. \({W}_{k}, {B}_{k}\) is obtained through training, which is also shared in each layer. Equations (3) and (5) describe the first and the final aggregation, where the 0-layer aggregation directly equals \({x}_{v}\)—the feature of node \(v\), and the final output \({z}_{v}\) is the aggregation of Kth layer.

In our framework, we design 2 channels to train SCGMW and GPT respectively. For each channel, a multi-layer GCN is implemented, and \(Relu\) is adopted as the activation function. We use Adam algorithm as the optimizer and CrossEntropyLoss as the loss function. The goal of our work is to accomplish graph classification, hence we collect all node’s information according to Eq. (5) to represent the whole graph. Here, we choose Max function as the aggregation method instead of taking average of all nodes’ output, for the influence of risk should be magnified rather than weakened.

Evaluation

In this section, we evaluate and discuss the effectiveness of the proposed framework. Sect. "Environment setup" describes how we create the sandbox environment, and Sect. "Dataset creation" presents the dataset creation, while Sect. "Experimental design" shows the experimental design.

Environment setup

We create a sandbox environment to capture the dynamic behavior. To ensure more exposure of malicious behavior, we choose Microsoft Office Pro Plus 2013 for Windows v15.0.5441 as the affected edition. We use Process Monitor V3.5.3 as the capturing tool and VMware® Workstation Pro 15.5.6 build-16341506 to run the virtual machine. Considering exploitation conditions, we maintain the installed software and Windows Components in the affected editions, such as.NET 4.5.2 (CVE-2017-8759) and Internet Explorer 11 (CVE-2020-0674). The main parameters of the environment are shown in Table 5.

Table 5 Edition and parameter of the sandbox

Since the research object of this paper is the vulnerable documents, we disable the macros to prevent interference within capturing process. During the period of capturing, we record 59 kinds of actions from all processes except for ProcessMonitor itself, including file operation, registry operation, network connection, process information, and so on.

After opening the document, we set 10 s as the runtime for capturing. Although some maldocs wouldn’t launch attack immediately, the exploitation has already taken place once the document is opened.

Moreover, the deep learning module in GLDOC is developed in Pytorch 1.10.0 (Pytorch. xxxx) and the neural networks are implemented with PyTorch Geometric 2.0.3 (Pytorch geometric xxxx). The experiments are performed on a workstation with 256 GB memory, an Intel Xeon E5-2698 CPU with 20 cores at 2.2 GHz, 4*Tesla V100 GPU with 128GB global memory.

Dataset creation

Because of lacking public dataset suitable for our experiments, we create a dataset containing 3 types of documents: im-documents, common maldocs and benign documents. These documents are in 3 kinds of file extensions, including DOC, DOCX and RTF.

In fact, it is hard to accurately pick im-documents for its sophisticated structure. Considering they may enjoy obfuscation, encryption or exploitation of some unknown vulnerabilities, it’s difficult to find im-documents only if we artificially analyze the document in sandbox. To simulate as much realistically as possible, we collect im-documents from 2 sources: (1) We download 986 sample files from VirusTotal, which are marked as malicious by 20% ~ 30% of antivirus engines. We choose this proportion because if less than 20% of engines mark it as malicious, this document cannot be collected in our dataset because there exists a high probability that the false alarm happens. All these documents are tagged vulnerable and with detailed CVE IDs. (2) We utilize existed exploitations to produce 200 im-documents ourselves. These im-documents are constructed by 4 vulnerabilities, and designed with implicitly malicious behaviors like opening a calculator, normal website connecting, downloading normal files from remote server, and etc. Figure 4 shows the vulnerability composition in im-documents.

Fig. 4
figure 4

Vulnerability composition in im-documents

Besides above im-documents, we still collect common maldocs (more than 60% of engines marked as malicious) from VirusTotal as a contrast, which consists of 808 vulnerable documents with detailed CVE ID from year of 2017 to 2021.

Meanwhile, we crawl and then pick 2632 documents from public searching engines as our benign samples, including Baidu (24.32%), Google (68.31%), and Bing (7.37%). To ensure all these documents are not malicious indeed, we use 360 Total Security (Security xxxx) to detect them all in advance.

Experimental design

In order to estimate our proposed framework and methods, we design the following experiments.

Detection on im-documents and common maldocs

In this experiment, we show the detection ability of our proposed framework and methods. We first introduce the evaluation metrics as follows.

$$FAR=\frac{FP}{FP+TN}$$
(6)
$$Accuracy=\frac{TP+TN}{TP+TN+FP+FN}$$
(7)

We denote True Positive (TP) as the number of maldocs that are correctly detected, True Negative (TN) as the number of benign documents that are correctly classified, False Negative (FN) as the number of maldocs that are predicted as benign ones, and False Positive (FP) as the number of benign ones that are classified as maldocs. Then, we use Eq. (6) to describe the detection rate of our proposed framework, while Eq. (7) to describe false alarm rate in detection.

This experiment aims at verifying the detecting performance on vulnerable documents. To briefly illustrate, we show the data distribution in Table 6. We divide the data into 3 subsets, where D0 denotes training and validation set, and D1, D2 are testing sets for different file types. Considering the limited size of our dataset, we adopt 10-Fold Cross Validation in training process, thus, the ratio of the training set to the validation set is 9:1 for each time, to avoid overfitting. Furthermore, we simultaneously change the number of GCN layers in both channels to acquire the best experimental results.

Table 6 Dataset creation

Figure 5 shows the accuracy and FAR on different subsets. Basically, GLDOC achieves a better performance on im-documents when the number of GCN layers increases, but remains stable after we adopt 5 or more layers. It also shows that compared with common maldocs, our proposed methods cannot decrease FAR on im-documents effectively, with the best result of 4.33% on D1. After analyzing the results of FAR, we find GLDOC has a relatively high rate of misidentification on im-documents. It perhaps because the difference in structure of action traces between im-documents and benign documents is much smaller than that between common maldocs and normal ones. Figure 5 also shows that our framework achieves better accuracy rate on different layers, namely 95.37–97.39%, on common maldocs. In fact, for common vulnerable documents, shallow GCN is enough to obtain the features as the increase of layer doesn’t bring any gains in accuracy rate. At last, we set 5-layer GCN as the final choice in GLDOC, since it has a better comprehensive result with a balance of accuracy rate and FAR, namely 95.33% and 4.33%.

Fig. 5
figure 5

Accuracy and FAR on different subsets

Improvements of blind spot among different methods

In this experiment, we test to show how GLDOC improves the possible blind spot of maldoc detection. To measure and demonstrate our framework’s contribution, we first design the other 4 methods according to main ideas from related works as comparison: (1) SCGMW method. In fact, detecting from the system call trace is one of the most popular methods in related works (Scofield et al. 2017; Xu et al. 2016; Kosoresow and Hofmeyer 1997). We adopt this idea and only reserve the upper channel in the deep learning phase and denote it as SCGMW method, standing for detection through system call graph of MS-Word, which most research adopt; (2) GPT method. Same as above, detecting from process tree is also the common approach which many researches adopt (Khan et al. 2017). We only reserve the lower channel in the deep learning phase and denote it as GPT method, stands for detection through the graph of process trees; (3) LSTM. In fact, deep learning methods have been applied on malware detection for years (Jiang et al. 2021; Kim et al. 2016). Considering no similar research on detection of vulnerable MS-Word documents using dynamic analysis, we design a LSTM method to compare with the graph based GCN. LSTM model handles input with time steps while system call trace is also such a sort of temporal information. Given the feature dictionary, we first complete node pooling for each action by TEBP Method, and then input into a simple LSTM to detect maldocs. The LSTM networks are constructed by torch.nn from Pytorch toolkit. We use the hidden state from all time steps as the output of LSTM and connect with a FC layer as a classifier. For optimization and loss function, we remain the same choice as we did in GCN. 4) Text Feature based SVM (TFSVM). To further demonstrate the necessity of graph structure extraction, we choose SVM, which only focuses on text feature extraction, as the final comparison method. TFSVM originates from statistical method and has always been a popular method for maldoc detection (Li et al. 2007; Liu et al. 2019; Hong et al. 2022). Different with the above 3 methods, we process the original whole sequence from ProcessMonitor (include MS-Office and other processes) and use TEBP method to estimate the risk level of each action. Acquiring each one’s risk level, we build the final feature as \(<{N}_{k},{N}_{f},{N}_{c},{N}_{kt},{N}_{km},{N}_{kl},{N}_{ct},{N}_{cl}>\), where \({N}_{k}\), \({N}_{f}\), \({N}_{c}\) denotes the number of behaviors in Registry Key, File & Directory and Networking Connection within the whole sequence, and \({N}_{kt}\), \({N}_{km}\), \({N}_{kl}\) means the number of high, medium and low risk actions in Registry Key, and \({N}_{ct}\), \({N}_{cl}\) means the number of high and low risk actions in Networking Connection.

We evaluate the above 4 methods as well as our proposed one on dataset D0-D2. During the training and testing, we maintain the same condition as the experiment 4.3.1, including ratio of train set to validation set, optimization and loss function. Figure 6 shows the accuracy rate and FAR for different methods on D0-D2.

Fig. 6
figure 6

Accuracy and FAR for different methods on D0-D2

Figure 6 indicates that: (1) Both SCGMW and GPT method has blind spots on detection of im-documents, with accuracy rate (SCGMW: 72%, GPT: 83%) and FAR (SCGMW: 21%, GPT: 13%). With analysis of results, we find SCGMW method cannot detect most of maldocs whose exploitation triggers outside of MS-Office, such as CVE-2017-11882 and CVE-2017-8759. While GPT detects the opposite way, it only focuses process relation but explores nothing within MS-Office, so as the analysis proves. (2) GLDOC outperforms other methods on both accuracy rate and FAR, while GPT method works relatively better among the rest. A possible reason is, maldocs usually exploit through command execution in Powershell.exe or CMD.exe, which may leave call relation as a hint for detection. (3) TFSVM and LSTM don’t perform well on this task. Since TEBP method focuses more on risk-related information, it seems unsuitable when processed with NLP or machine learning methods, which lose most of semantic information during pre-processing. Furthermore, LSTM cannot handle the long sequence of inputs well, but our trace data is always over 5000 lines, sometimes even 10,000 + lines. Also, the input data from TFSVM is in poor quality, for data differs little between im-documents and benign ones, even remains similar sometimes, which is exactly the main characteristic of im-documents.

Detection on unknown vulnerabilities in a simulated 5-day attacking scenario

To further test detection ability of unknown vulnerabilities, we simulate a 5-day attacking scenario with multiple training and testing. We first explain how we evaluate 0-day attack in this experiment. Listing CVE ID number in time order, we suppose the vulnerabilities with bigger CVE ID could be taken as 0-day threat to the framework which only trained by those vulnerabilities with smaller IDs. For example, if we only train framework with CVE-2017-11882 today, once encountering CVE-2018-8174 tomorrow, it can be recognized as “0-day” attack, for the trained framework has never seen this type of attack before.

We build a 5-day case to simulate the real attack. The details on experimental design are as follows: (1) We choose 4 vulnerabilities as the training set for the first day, and let our framework being attacked by one new coming vulnerability one day. Considering some new exploitation may learn from the old vulnerabilities, simulated attacks strictly come in order of vulnerability disclosing to build a more realistic attacking scenario. (2) We train in the beginning and test once the new attack comes. At the end of the day, we retrain with previous dataset as well as new coming data combined, then repeat the next day. It has to be mentioned that we don’t choose incremental learning here, still train and test in batches. Because we believe the new coming maldocs should stay in the same status, thus, get tested together to see the real detection ability of our proposed framework. (3) To simulate the conditions of real traffic case as well as enlarge the size of training dataset, we add some common maldocs but maintain at least 50% of im-documents in our dataset. (4) We first test with im-documents and 50% of benign documents, and then test with common maldocs with the rest 50% of benign documents. So we could compute the Accuracy and FAR respectively.

The timeline and experimental design are shown in Fig. 7.

Fig. 7
figure 7

Details in 5-day attacking scenario

Analyzing the results of this experiment, our proposed framework could effectively discover the unknown attacks at the accuracy rate of 91.5–93.0% on im-documents, and 96.0–97.0% on common maldocs during Day2 to Day5. Although our framework hasn’t seen the new coming samples before, it still could detect because the sample’s malicious behaviors always locate in system call trace or process relations. But GLDOC still has a relatively high rate of FAR on im-documents, just as experiment 4.3.1, which means wrongly recognizing benign documents as malicious ones.

Discussion and conclusion

In this paper, we give the definition of im-documents, to describe those maldocs which expose implicitly malicious behaviors and meanwhile escape most public antivirus engines. With analysis of its characteristics and attacking scenario, this paper advocates a GCN based framework—GLDOC, to achieve the detection of im-documents, which covers the blind spots of past researches as well as discovers the unknown attacks. We believe the proposed approach can be easily deployable and widely applicable in real-world. To the best of our knowledge, this paper is the first to formulate the maldoc detection problem into the graph learning and classification problem, a direction along which we believe more solutions can be invented. The final simulation demonstrates our proposed framework maintains stable and high detection accuracy. When tested in 5-day attacking scenario, GLDOC is capable of discovering the unknown vulnerabilities.

It is the first time that GCN has been adopted to solve the document classification task. In fact, we extract the dynamic features from the maldoc’s opening and transform the maldoc detection task into the graph set classification task. We believe it is effective because the inherent operation law between the malicious documents and benign ones is indeed different, which makes our DL-based framework could distinguish them once capturing all their dynamic behaviors.

However, GLDOC has its drawbacks. First, it has a relatively high FAR. We analyze the false positive samples, most of which modify some low-risk Registry Key. It seems that Registry Key modification is sort of sensitive to maldoc’s opening, which fools GLDOC to tell whether the document is malicious or not. When it comes to the false negatives, we found all of them are vulnerable documents with no obviously malicious behaviors. Their exploitation includes opening an calculator, renaming a directory, starting the browser and etc. We believe it essentially results from the little difference between im-documents and benign ones, especially when vulnerability triggers but expose few malicious behaviors. Second, what we capture is centering on I/O level operation, while underlying Windows API operation is needed for further improvement of detection accuracy.

Availability of data materials

Not applicalbe.

References

  • 2020 Global Advanced Persistent Threat APT Research Report. Available at https://www.freebuf.com/sectool/242507.html

  • A roundup of the world's top 10 APT attacks in 2018. Available at https://www.freebuf.com/articles/193393.html

  • An update on MD5 poisoning. Available in https://blog.silentsignal.eu/2016/11/28/an-update-on-md5-poisoning/

  • Anderson B, Quist D, Neil J, Storlie C, Lane T (2011) Graph-based malware detection using dynamic analysis. J Comput Virol 7(4):247–258

    Article  Google Scholar 

  • Domain takeover report in Hackerone. Available in https://hackerone.com/reports/1253926

  • Hong J, Jeong D, Kim SW (2022) Classifying malicious documents on the basis of plain-text features: problem, solution, and experiences. Appl Sci 12(8):4088

    Article  Google Scholar 

  • https://www.fireeye.com/content/dam/fireeye-www/solutions/pdfs/ig-email-security-gap.pdf

  • Huneault S, Talhi C (2020) P-Code based classification to detect malicious VBA macro. In: 2020 International Symposium on Networks, Computers and Communications (ISNCC), Montreal, Canada, (pp. 20–22)

  • Jiang J, Wang C, Yu M, et al. (2021) NFDD: a dynamic malicious document detection method without manual feature dictionary[C]. Lecture Notes in Computer Science, (pp. 147–159)

  • Khan MS, Siddiqui S, Ferens K (2017) Cognitive modeling of polymorphic malware using fractal based semantic characterization. In: Technologies for Homeland Security (HST), 2017 IEEE International Symposium, Waltham, MA, USA, IEEE, (pp. 1–7)

  • Kim G, Yi H, Lee J, et al. (2016) LSTM-based system-call language modeling and robust ensemble method for designing host-based intrusion detection systems. arXiv preprint arXiv:1611.01726

  • Kim S, et al. (2018) Obfuscated VBA macro detection using machine learning. In: 2018 48th annual IEEE/IFIP international conference on dependable systems and networks (DSN), Luxembourg, (pp. 25–28)

  • Kipf TN, Welling M (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907

  • Kosoresow AP, Hofmeyer S (1997) Intrusion detection via system call traces. IEEE Softw 14(5):35–42

    Article  Google Scholar 

  • Li W, Su PR, Shi YF (2010) A technique for detecting based on calculation malicious documents of vector spaces. J Grad Sch Chin Acad Sci 2:267–274

    Google Scholar 

  • Li WJ, Stolfo S, Stavrou A, et al. (2007) A study of malcode-bearing documents[C]. Lecture Notes in Computer Science, (pp. 231–250)

  • Liu L, He X, Liu L et al (2019) Capturing the symptoms of malicious code in electronic documents by file’s entropy signal combined with machine learning. Appl Soft Comput 82:105598

    Article  Google Scholar 

  • Long-lasting exploitation of backdoors. Available at https://paper.seebug.org/1007/

  • Maiorca D, Giacinto G, Corona (2012) A pattern recognition system for malicious PDF files detection. In: International Conference on Machine Learning and Data Mining in Pattern Recognition. (pp. 510–524)

  • Meng X, Kim T (2017) PlatPal: detecting malicious documents with platform diversity. In: USENIX Security Symposium, (pp. 271–287)

  • Microsoft bounty program. Available at https://www.microsoft.com/en-us/msrc/bounty

  • Mimura M, Taro O (2020) Using LSI to detect unknown malicious VBA macros. J Inf Process 28:493–501

    Google Scholar 

  • Nissim N, Cohen A, Glezer C, Elovici Y (2014) Detection of malicious PDF files and directions for enhancements: a state-of-the art survey. Comput Secur 49:246–266

    Google Scholar 

  • Nissim N, Cohen A, Elovici Y (2017) ALDOCX: detection of unknown malicious microsoft office documents using designated active learning methods based on new structural feature extraction methodology. IEEE Trans Inf Forensics Secur 12(3):631–646. https://doi.org/10.1109/TIFS.2016.2631905

    Article  Google Scholar 

  • Pytorch geometric. Available at https://pytorch-geometric.readthedocs.io/en/latest/

  • Pytorch. Available at https://pytorch.org/

  • Report for prepared file in Threatcn. Available at https://s.threatbook.cn/report/file/05c2c1cdcafcce4e9c64e900298d0bc07ebd4be9af861da74df79ed5ed36ced8

  • Report for prepared file in VirusTotal. Available at https://www.virustotal.com/gui/file/05c2c1cdcafcce4e9c64e900298d0bc07ebd4be9af861da74df79ed5ed36ced8

  • Ruaro N, Pagani F, Ortolani S (2022) SYMBEXCEL: Automated analysis and understanding of malicious excel 4.0 macros. In: IEEE Symposium on Security and Privacy (SP), San Francisco, USA, (pp. 23–25)

  • Scofield D, Miles C, Kuhn S (2017) Fast model learning for the detection of malicious digital documents. In: Proceedings of the 7th Software Security, Protection, and Reverse Engineering, Software Security and Protection Workshop

  • Security basis: response and feedback. Available at https://blog.csdn.net/wutianxu123/article/details/82940721

  • Total Security. Available at https://weishi.360.cn/?source=homepage/

  • Tan Y, Liu Y, Long G, et al. (2022) Federated learning on non-IID Graphs via structural knowledge sharing. arXiv preprint arXiv:2211.13009

  • Tan Y, et al (2023) Federated learning on non-iid graphs via structural knowledge sharing. Proc AAAI Conf Artif Intell 37(8):9953-9961

    Google Scholar 

  • Top 10 vulnerabilities used by APT organizations in recent years. Available at https://www.freebuf.com/articles/network/168121.html

  • Xu L, Zhang D, Alvarez M A, et al. (2016) Dynamic android malware classification using graph-based representations. In: IEEE 3rd international conference on cyber security and cloud computing (CSCloud). IEEE, (pp. 220–231)

  • Yu M, Jianguo J, Gang L et al (2021) A survey of research on malicious document detection. J Cyber Secur. 6(3):54–76

    Google Scholar 

Download references

Acknowledgements

We are also very thankful for all those valuable comments given by anonymous reviewers.

Funding

This work is supported by the National Natural Science Foundation of China (General Program, NO.62176264).

Author information

Authors and Affiliations

Authors

Contributions

WW: Conceptualization, Methodology, Data curation, Software; PY: Funding, Validation; TK: Writing-Reviewing and Editing; WH:Writing-Original draft; CW: Writing-Reviewing and Editing.

Corresponding authors

Correspondence to Wenbo Wang, Peng Yi or Weitao Han.

Ethics declarations

Competing interests

We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work, there is no professional or other personal interest of any nature or kind in any product, service and/or company that could be construed as influencing the position presented in, or the review of, the manuscript entitled.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, W., Yi, P., Kou, T. et al. GLDOC: detection of implicitly malicious MS-Office documents using graph convolutional networks. Cybersecurity 7, 48 (2024). https://doi.org/10.1186/s42400-024-00243-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s42400-024-00243-7

Keywords