In order to more effectively identify unknown application, we propose a subspace clustering via graph autoencoder network (SCGAE), which can simultaneously use the statistical features of application flows and the structural information between application flows to more comprehensively analyze and identify the types of unknown application.
We introduce the framework of SCGAE in this section, as illustrated in Fig. 1. The SCGAE is composed of four main modules: (a) a graph autoencoder (GAE) module for mining the statistical characteristics and structural information of application flow at the same time; (b) a selfexpression module for integrating potential flow features to construct coefficient matrix; (c) a clustering module for identifying different applications; and (d) a selfsupervised module for constraining the distribution consistency of clustering and pseudo labels.
Constructing the graph
Before introducing SCGAE, it is necessary to construct a suitable application flow graph, so that the model can cluster more effectively. For general clustering tasks, the top \(n_k\) nearest samples to each sample can be filtered out based on similarity. However, in actual network traffic, traffic masquerading and application protocols of different versions will change the statistical characteristics of the application flow, so that different encrypted traffic types have similar characteristics. Direct similarity measurement can easily misjudge these encrypted traffic types, thus limiting the recognition ability of the model. Considering that the communication between host applications tends to be close over a period of time, this behavior reflects the spatial distribution characteristics of traffic, so we prioritize the IP address of each application flow when composing the map. In addition, in order to efficiently aggregate neighbor features during message passing, we further consider applying a flow similarity measure to narrow the distance between samples with similar features. It is worth mentioning that the complexity of the flow graph construction method will increase if the attribute characteristics of the application flow are considered first rather than the IP address to which the flow belongs. Because similarity calculation needs to consider all application flows, and for complex network environments, the number of application flows often far exceeds the number of IP pairs, which will greatly increase the computational overhead. The method that prioritizes structural attributes can not only have high computational efficiency, but also make the model have a certain antiinterference during the training process. Figure 2 depicts the proposed flow graph construction method based on flow similarity and IP address association.
Firstly, according to the source and destination IP address, a graph \(G(V, E), v \in V, e \in E\) is established, where the vertex V denotes the set of IP addresses, and the edge E is the set of application flow. As shown in Fig. 2a, we use nodes H, I, J and K, to denote the IP addresses, respectively. And the application flows are represented by edges \(f_i,i=1,\ldots ,6\). The different colors indicate different applications. It can be seen that \(f_2\), \(f_4\), and \(f_6\) belong to the same application, while \(f_1\), \(f_2\), and \(f_5\) belong to the same application. This situation is very common in a real network environment. For example, in the same chat software, different users can send audio messages or transfer files. Next, we transform graph G to \(G ^{\star }=(V ^{\star }, E ^{\star })\) are as follows.

1
Convert the edge e in the graph G to the vertex \(v^{\star }\) in the graph \(G^{\star }\), and convert the vertex v in the graph G to the vertex \(e^{\star }\) in the graph \(G^{\star }\). Thus, the vertex \(v^{\star }\) represents the application flow, and the edge \(e^{\star }\) represents the application flow with the same IP address. For example, the two edges between H and I in Fig. 2a, i.e., \(f_1\) and \(f_2\), converts to two vertices in Fig. 2b. In addition, application flows with the same IP address have more similar location information, and their corresponding edges will be recorded in the edge set \(e^{\star }\). For example, the two edges in Fig. 2a, namely \(f_1\) and \(f_3\), come from or go to the same vertex I, so the vertices \(f_1\) and \(f_3\) in Fig. 2b have an edge to connect.

2
Measure the similarity between a vertex in the graph \(G^{\star }\) and its connected vertices. This similarity is the feature similarity between vertices. Calculate the similarity between each vertex in the graph and other vertices, and get the corresponding value of each vertex \(v^{\star }\) Similarity vector sim. The similarity measure is based on the fact that application flows have the same IP address. In the current study, we choose Euclidean distance as the similarity measure, as shown in Eq. (1), where the feature of each application flow \(f_i(i=1, \ldots , N)\) is a ddimensional vector, defined by \(f_i =[f_{i1}, f_{i2}, \ldots , f_{id}]\).
$$\begin{aligned} dis_{ij} = \sqrt{\sum _{k=1}^d (x_{ik}x_{jk})}. \end{aligned}$$
(1)

3
Filter the edge group \(e^{\star }\) so as to to get the transformed graph, which is denoted by \(G^{\star }\). Specifically, the nearest distance is used to filter \(e^{\star }\), which is an edge set consisting of application flows with the same IP address. For each vertex \(x_i\), filter the top \(n_k\) nearest vertices in its edge set, that is, the flows with the highest similarity and the same IP address.
Finally, the transformed flow graph is shown in Fig. 2c. Taking vertex \(f_1\) as an example, it and other three vertices, namely, \(f_2\), \(f_3\) and \(f_6\), form three edges because of similar features and the same IP address. Constructing the flow graph in this way can make the relationship between application flows in Fig. 2a more intuitively reflected in Fig. 2c.
GAE module
As mentioned above, the relationship between flow can effectively improve the clustering performance. Therefore, base on Kipf and Welling (2016a), Hammond et al. (2011), we proposes a GAE module to use the statistical characteristics of application flows and the structural information between application flows at the same time.
Specifically, graph convolution is performed for each GAE layer, and the highorder discriminant information is learned based on the feature matrix F and the adjacency matrix A:
$$\begin{aligned} O^{(l)} = \sigma \left( {\hat{D}}^{\frac{1}{2}}(A+I){\hat{D}}^{\frac{1}{2}}O^{(l1)}W^{(l)} \right) , \end{aligned}$$
(2)
where \(W^{(l)}\) and \(O^{(L)}\) represent the weight matrix and output matrix of the lth GAE layer, respectively. In addition, \({\hat{D}}^{\frac{1}{2}}(A+I){\hat{D}}^{\frac{1}{2}}\) is the convolution kernel or filter , \({\hat{D}}\) is the degree matrix of A, where \({\hat{D}}_{ii}=\sum _j(A+I)_{ij}\). Furthermore, the sum of the adjacency matrix and the identity matrix, i.e. \(A+I\), is to ensure the selfloop of each node.
It should be noted that the first layer in GAE module only uses the feature matrix F as the input matrix:
$$\begin{aligned} O^{(1)} = GAE(F, A) = \sigma \left( {\hat{D}}^{\frac{1}{2}}(A+I){\hat{D}}^{\frac{1}{2}}FW^{(1)}\right) \end{aligned}$$
(3)
Then, the output \(O^{(l1)}\) of the \((l1)\) layer will be used as a new input matrix, that is, the input matrix of the lth GAE layer, to generate a new output matrix \(O^{(l)}\):
$$\begin{aligned} O^{(l)}= GAE(O^{(l1)},A) = \sigma ({\hat{D}}^{\frac{1}{2}}(A+I){\hat{D}}^{\frac{1}{2}}O^{(l1)}W^{(l)} ) \end{aligned}$$
(4)
In this study, we choose a simple inner product operation as same as Kipf and Welling (2016b) to reconstruct the relationship between samples using the output matrix \(O^{(L)}\) of the last GAE layer,
$$\begin{aligned} {\hat{A}} = Sigmoid\left( O^{(L)T},O^{(L)}\right) \end{aligned}$$
(5)
where \({\hat{A}}\) is the reconstructed adjacency matrix. In addition, the embedding representation in the middle layer of the GAE module, namely \(O^{(\frac{L}{2})}\), is used for selfexpression and selfsupervised module.
Selfexpressive module
After creating a wellmatched latent space through the GAE module, the goal of the selfexpression module is to linearly represent each vertex by integrating the flow features of other vertices. Suppose the potential representation \(O^{(\frac{L}{2})}\) comes from K subspaces, i.e., \(X_i, i=1,\ldots ,K\). Then, each potential feature vector, namely \(O^{(\frac{L}{2})}_i, i=1,\ldots ,N\), can be expressed as a linear combination of other samples in the same subspace:
$$\begin{aligned} O^{(\frac{L}{2})}_i=\left( \alpha _1 O^{(\frac{L}{2})}_1+\cdots +\alpha _ {i1}O^{(\frac{L}{2})}_{i1}+\alpha _{i+1}O^{(\frac{L}{2})}_{i+1}+\cdots +\alpha _n O^{(\frac{L}{2})}_n\right) /\alpha _i. \end{aligned}$$
(6)
This is the definition of selfexpressive property, and its matrix form is expressed as
$$\begin{aligned} O^{(\frac{L}{2})}=O^{(\frac{L}{2})}S, \end{aligned}$$
(7)
where \(S \in R^{N \times N}\) is the selfexpression coefficient matrix with a block diagonal structure.
When the subspace is independent, the selfexpression matrix S can be obtained by minimizing some norms of S, so it is mathematically transformed into an optimization problem.
$$\begin{aligned} &\min _C \parallel S \parallel _p \quad \\&s.t. O^{(\frac{L}{2})}_i=O^{(\frac{L}{2})}_iS, (\text {diag}(S=0)) , \end{aligned}$$
(8)
where \(\parallel \cdot \parallel _p\) represents any regularized norm, and the constraint \(\text {diag}(C=0)\) is introduced to avoid singular matrices. In order to solve the above optimization problem, we relax the problem Eq (8) and transform it into:
$$\begin{aligned}&\min _S \parallel S \parallel _p + \frac{\uplambda }{2} \parallel O^{(\frac{L}{2})}  O^{(\frac{L}{2})}S \parallel _F^2 \quad \\&s.t.(\text {diag}(S=0)) . \end{aligned}$$
(9)
As shown in Ji et al. (2017), the weight of the selfexpression module corresponds to S, and it is further used for clustering module. It is worth noting that since the objects belonging to each category have inherent features that are different from other groups, generating a selfexpression matrix through the latent space represented by GAE can make the spectral clustering in the clustering module more effective.
Clustering module
As described in the selfexpression module, after obtaining the selfexpression matrix S, we use the clustering module to label different application flows. Before using the clustering algorithm, we first set a threshold to filter the noise in the matrix, that is, retain other samples with high selfexpression coefficients in each eigenvector. Next, we convert matrix S into affinity matrix \(\Lambda\), as follows:
$$\begin{aligned} \Lambda = \frac{1}{2}\Big (\vert S \vert + \vert S \vert ^T\Big ) \end{aligned}$$
(10)
Then, the affinity matrix \(\Lambda\) is used in the spectral clustering method (Ng et al. 2002), so as to realize the identification of unknown application flows, that is, the clustering result T of the SCGAE model is given.
Selfsupervised module
In an unsupervised task, we cannot tell whether the clustering result T is consistent with the actual labels. Moreover, the embedding representation learned in the GAE module is only to obtain more discriminative information, and has no direct connection with the clustering module. To address this issue, we design an auxiliary task that uses a crossentropy loss function in a selfsupervised module to constrain and integrate the embedding representations learned by GAE module, making it more suitable for clustering tasks. Specifically, a GAE layer is used to cluster the latent representation \(O^{(\frac{L}{2})}\), and the pseudolabel P can be obtained, where \(P\in R^ {N\times K}\). Then the result \(T\in R^{N\times K}\) obtained in the clustering module can be expressed as a temporary label \(Q\in R^{N\times K}\). The training objectives of the selfsupervised module are:
$$\begin{aligned} \min \sum _{i=1}^N\sum _{c=1}^C p_{ic} log\frac{p_{ic}}{q_{ic}}. \end{aligned}$$
(11)
Note that if the cluster label T changes with each iteration, it may limit the convergence of the model. Therefore, we use a training technique by setting the number of clustering iterations \(T_c\). This means that the trigger of the T update Q is after every \(T_c\) iteration, so the loss function is stable within the \(T_c\) iteration.
Overall loss function
We use graph and content reconstruction error as the loss function of GAE module, as shown in formula (12). Here, by minimizing the loss between A and \({\hat{A}}\), the GAE module can preserve more about the structural relationship between application flows in the embedding representation. This means that application flows formed by the same IP pair have a higher probability of belonging to the same application class than application flows formed by different IP pairs. In addition, we constrains the GAE module to retain enough flow feature by minimizing the F and \({\hat{F}}\) losses.
$$\begin{aligned} &L_{gaeg}= \frac{1}{2N}\parallel A{\hat{A}}\parallel _F^2,\\&L_{gaec}= \frac{1}{2N}\parallel F O^{(L)} \parallel _F^2. \end{aligned}$$
(12)
Next, in selfexpression module, the loss function consists of selfexpression loss and regularization loss, as shown in Eq. (13). The selfexpression loss function is to make the embedding representation learned by the GAE middle layer as close to the transformed selfexpression matrix as possible, and the regularization loss is to prevent the matrix C from becoming too sparse.
$$\begin{aligned}&L_{ser}=\parallel S \parallel _p, \\&L_{se}=\frac{\uplambda }{2}\parallel O^{(\frac{L}{2})}  O^{(\frac{L}{2})}S \parallel _F^2. \end{aligned}$$
(13)
Then, in selfsupervised module, the loss function is:
$$\begin{aligned} L_{ss}=\sum _{i=1}^N \sum _{c=1}^C p_{ic} log\frac{p_{ic}}{q_{ic}}. \end{aligned}$$
(14)
Finally, we can summarize the overall loss function in SCGAE as bellow,
$$\begin{aligned} L_{overall} = \uplambda _1 L_{gaeg} + \uplambda _2 L_{gaec} + \uplambda _3 L_{ser} + \uplambda _4 L_{se} + \uplambda _5 L_{ss}, \end{aligned}$$
(15)
where \(\uplambda _i(i=1,\ldots ,5)\) represent the tradeoff coefficient.