Phishing behavior detection on different blockchains via adversarial domain adaptation

Despite the growing attention on blockchain, phishing activities have surged, particularly on newly established chains. Acknowledging the challenge of limited intelligence in the early stages of new chains, we propose ADA-Spear-an automatic phishing detection model utilizing a dversarial d omain a daptive learning which symbolizes the method’s ability to penetrate various heterogeneous blockchains for phishing detection. The model effectively identifies phishing behavior in new chains with limited reliable labels, addressing challenges such as significant distribution drift, low attribute overlap, and limited inter-chain connections. Our approach includes a subgraph construction strategy to align heterogeneous chains, a layered deep learning encoder capturing both temporal and spatial information, and integrated adversarial domain adaptive learning in end-to-end model training. Validation in Ethereum, Bitcoin, and EOSIO environments demonstrates ADA-Spear’s effectiveness, achieving an average F1 score of 77


Introduction
Since the introduction of Bitcoin (Nakamoto 2008) in 2008, blockchain and cryptocurrencies have flourished.According to CoinMarketCap (CoinMarketCap), there are now 25,853 different cryptocurrencies, with a market capitalization exceeding one hundred billion dollars.Typically, a blockchain gives rise to its own cryptocurrency, and this financial characteristic has resulted in a surge of phishing activities.Statistics from Chainalysis (Chainanalysis) reveal that since 2017, more than 50% of blockchain security incidents are linked to phishing.By 2022, the proportion of phishing incidents has consistently risen to over 80%.Consequently, there is an urgent need to research methods for detecting phishing activities across different cryptocurrencies.
Traditional phishing activities generally involve the use of fake websites to induce users to provide private information.Thus, traditional phishing detection focuses on identifying these counterfeit websites and promptly warning users against interacting with them (Jain et al. 2017;Zuraiq and Alkasassbeh 2019;Orunsolu et al. 2022).Phishing activities on the blockchain, however, have developed new patterns.Criminals have shifted their focus from stealing private information to cryptocurrencies, employing a combination of social engineering and technical methods.Upon successfully obtaining cryptocurrencies, they disguise their identities through multiple transactions, increasing the covert nature of their activities.Moreover, as different cryptocurrencies continually emerge, possessing reliable tagged data for each is extremely valuable.Newly emerged cryptocurrencies lack any tagged data, requiring a significant amount of time to accumulate relevant intelligence databases.At that point, the damage has already occurred, significantly dampening user enthusiasm and hindering the development of new chains.
This paper aims to tackle the challenge of detecting phishing activities in diverse blockchain networks (target blockchain) by utilizing labeled data from source blockchains, as illustrated in Fig. 1.We addresses the issue of ineffective phishing detection on newly emerged chains during their early stages, characterized by limited on-chain annotations and differing data distributions across chains.Current detection methods cannot be directly generalized to effectively detect phishing activities on new chains, resulting in delays in halting phishing behavior during the initial stages of chain emergence.While mature chains like Ethereum possess more abundant data and established detection methods, the lack of information on target chain samples poses a challenge.To address this, we propose a adversarial domain adaptation-based method (Pan and Yang 2009;Ganin et al. 2016;Shen et al. 2018;Goodfellow et al. 2020) for phishing detection in small-sample public chains with limited annotations on the target chain named ADA-Spear.
The challenges are as follows: Firstly, current detection methods heavily depend on manual feature engineering, require substantial expert knowledge, and are unsuitable for mining deep patterns on the blockchain.Moreover, their generalizability is weak, making them challenging to apply to different chains.Secondly, there exists a substantial distribution drift (Wiles et al. 2021) between different chains, accompanied by low attribute overlap.This implies that there are differences in both the data distribution and feature space between the source and target blockchains.Therefore, the phishing address patterns on the source blockchain are difficult to directly apply to the target blockchain, leading to overfitting of the model to the features of the source chain and a subsequent decrease in the model's generalization ability.Additionally, the presence of coin mixing and other anonymity services prevents the cross-chain transfer of edge information from the source to the target chain, hindering knowledge transfer.Thirdly, in the source chain, trustworthy labels are sparse.Even in Ethereum, which has abundant labels, phishing activity labels still only account for about 0.2% of total addresses (Etherscan).This scarcity of usable information in the source chain leads in the inadequacy of full supervision learning robustness.
Building upon this, we introduce adversarial domain adaptation techniques from transfer learning and proposes a small-sample phishing detection method for public chains.Firstly, we propose a subgraph construction algorithm based on chain structure to transform heterogeneous graphs from different chains into homogeneous graphs, thereby alleviating the problem of significant data distribution drift between chains structurally.Secondly, we introduce a hierarchical representation encoder at both node and subgraph levels to capture spatial and temporal information of node behaviors, obtaining high-dimensional representations of node features.This encoder is better suited for mining deep patterns in blockchain data.Thirdly, we apply adversarial domain adaptation networks to the node representations across different chains to mitigate the low overlap of attributes and data distribution drift between the source and target chains.Simultaneously, the adversarial domain adaptation network effectively enhances knowledge transfer capability when there is no cross-chain edge for information propagation from the source to the target chain, enabling timely and effective phishing detection on new target chains with limited labels (Singh 2019;Sayadi et al. 2019;Chen et al. 2020a, b;Wu et al. 2020;Ao etal. 2021;Xu et al. 2018;Wu et al. 2020;Perozzi et al. 2014;Grover and Leskovec 2016;Wang et al. 2016Wang et al. , 2017;;Ji et al. 2021;Heimann and Koutra 2017;Hamilton et al. 2017;Kipf and Welling 2016;Zheng et al. 2023;Yang et al. 2016;Fang et al. 2013;Ni et al. 2018;Xu et al. 2017;Pan and Yang 2009;Ganin et al. 2016;Shen et al. 2018;Goodfellow et al. 2020).
Specifically, ADA-Spear aims to characterize address behavior patterns and achieve knowledge transfer of behavior patterns from the source network to the target network, thereby facilitating the detection of phishing activities across different chains.The ' ADA' in ADA-Spear stands for Adversarial Domain Adaptation, and 'Spear' symbolizes the method's ability to penetrate various heterogeneous blockchains for phishing detection.In the subgraph construction method, we treat the target address as the central node, and after obtaining its second-order neighbor nodes, propose a reduction strategy for adapting to the full-sample gradient descent training in neural networks.Subgraphs are established, abstracting the phishing detection into a subgraph classification task (Zhang et al. 2021;Narayanan et al. 2017).The main idea of the detection model is to learn different inter-class subgraph representations through a network encoder with multi-dimensional feature fusion and invariant subgraph representations across different chains through adversarial domain adaptation.It should be noted that multidimensional features refer to time-based behavioral features, space-based behavioral features, and source data on blocks.Consequently, ADA-Spear consists of an encoder module that characterizes address behavior and a domain adaptation module.It should be noted that multidimensional features refer to temporal-based behavioral features, spatial-based behavioral features, and addresses' source data on blocks.
On one hand, the encoder module primarily aims to better characterize address behavior and learn class-level discriminative subgraph representations.It trains an effective classifier using existing label and considers the temporal evolution of behavior to characterize node-level feature representations and subgraph-level behavior patterns (Jiao et al. 2021).On the other hand, the adversarial domain adaptation module is designed to mitigate the distribution drift between the source and target networks so that transfer phishing pattern knowledge between chains.It employs adversarial learning to learn invariant subgraph representations across different chains, facilitating the transfer of behavior pattern knowledge from the source network to the target network.The training process is similar to that of Generative Adversarial Networks (GANs) (Goodfellow et al. 2020;Dai et al. 2018;Pan et al. 2018;Dai et al. 2019), where the encoder learns inter-chain invariance, and the domain discriminator distinguishes whether subgraph representations originate from the source or target domain.By combining these two parts, ADA-Spear can learn class-level discriminative subgraph representations and inter-chain invariant representations, facilitating the transfer of class information across different chains.
We choose Ethereum, Bitcoin, and EOSIO as three blockchain platforms, serving respectively as the source and target domains, to research cross-blockchain-network phishing behavior detection.In summary, our contributions are: • We propose a novel approach ADA-Spear, a neural network detection model with adversarial domain adaptation and multi-dimensional feature fusion, tackling the challenge of detecting phishing activities in diverse blockchain networks.

Related work
The detection of phishing activities on blockchain primarily employs feature engineering and graph analysis methods, where graph analysis can further be divided into single-network learning, graph-based semi-supervised learning, and cross-network learning methods.

Feature engineering based method
Feature engineering methods (Sayadi et al. 2019;Singh 2019) rely heavily on expert knowledge of blockchain phishing behavior, making it time-consuming and laborintensive.

Graph analysis based method Single network learning method
This is an unsupervised learning method that can learn node representations based on the network's topological structure or other information for subsequent tasks like node classification, as seen in algorithms like Deep-Walk.Generally, these methods use unsupervised learning to embed network structural information, followed by training classifiers for node or subgraph classification (Wang et al. 2016(Wang et al. , 2017;;Ji et al. 2021).Yuan et al. (2020) are the first to use DeepWalk (Perozzi et al. 2014) and Node2vec (Grover and Leskovec 2016) on blockchain to learn topological representations of addresses, later using machine learning for address classification.As Etheruem phishing itself also exhibits its own behavioral characteristics, so Wu et al. (2020) (Heimann and Koutra 2017).Additionally, their classification effectiveness is not as robust as graph-based semi-supervised learning methods.Therefore, these methods are also not applicable for detecting cross-chain phishing behaviors.

Graph-based semi-supervised learning method
To further improve the classification of phishing activities, later methods often use graph-based semi-supervised learning (Hamilton et al. 2017

Preliminaries
This section mainly introduces the definition and feature construction of on-chain interaction graphs and the definition of the issues studied in this paper.Table 1 shows the main symbols used in our framework.

On-chain interaction graph
This subsection defines the Ethereum, Bitcoin and EOSIO networks modeled as graph structures and gives the Graph definition after subtraction.
weighted, homogeneous multigraph G = (V e , E e , A e , F e v , F e e , Y e ) , where V e , E e , and A e respec- tively represent the sets of nodes, directed edges, and the adjacency matrix.The edge set E e is defined as and F e e respectively represent the feature matrices for Ethereum nodes and edges, as detailed in Sect.3.2.Some nodes in the E-AIG are labeled with

EOSIO Address Interaction Graph (EO-AIG)
The definition is the same as for the E-AIG and will not be repeated.

Bitcoin Address-Transaction Interaction Graph (B-ATIG
) , where V b a and V b t respectively represent the sets of address nodes and transaction nodes, E b represents the set of edges, defined as F b e respectively represent the feature matrices for Bitcoin address nodes and edges, as detailed in Sect.3.2.Some address nodes in the B-ATIG are labeled with y i ∈ Y b , where As the original B-ATIG is a heterogeneous graph, to align with the graph network of E-AIG and reduce computational complexity, the strategy proposed in Sect.4.1 is used to remove transaction nodes from B-ATIG, converting it into a homogeneous graph B-AIG.
Bitcoin Address Interaction Graph (B-AIG) A directed, weighted, homogeneous multigraph , where V b , E b , and A b are the sets of address nodes, transformed set of directed edges, and adjacency matrix, respectively.The transformed edge set a } , and F b e is the edge feature matrix after the removal of transaction nodes.
E-AIG, EO-AIG, and B-AIG are all dense connected homogeneous multigraphs, which increases the complexity during subsequent knowledge transfer.Thus, through feature reconstruction, interaction aggregation, and node reduction as described in Sect.4.1, the directed graphs r-AIG (E-AIG→rE-AIG, EO-AIG→rEO-AIG, B-AIG→ rB-AIG) are obtained.

Reduced Address Interaction Graph (r-AIG)
, where V r is the set of nodes after the TopK reduction described in Sect.4.1, E r = {(v i , v j , F r e )|v i , v j ∈ V r } is the remaining set of edges, and A r is the adjacency matrix.F r v and F r e are the feature matrices for the nodes V r and edges E r , respectively.
The loss function of classifier, domain discriminator and penalty factor Some nodes in r-AIG are labeled with y i ∈ Y r , where

Feature construction
Node features and edge features in Ethereum, Bitcoin, and EOSIO are categorized into three types: transaction time, transaction amount, and transaction count.Each type is differentiated by direction and further classified into incoming and outgoing transactions.In multigraphs, node features primarily include overall subgraph information like lifespan, total amount, degree, and the number of active nodes.Edge features mainly encompass information for each transaction, including block number, timestamp, transaction amount, and transaction fees.For transaction amount-type features, the maximum, minimum, and average functions will be used.
Transaction amount-type features will utilize the maximum, minimum, and average functions.Similarly, transaction count-type features, based on the same approach, are combined with transaction amount-type features to derive per-transaction average features.

Problem definition
The primary focus of this paper is on detecting crossblockchain-network phishing behavior through domain adaptation.Specifically, it leverages knowledge of phishing behavior in the source domain to assist in recognizing phishing behavior in a new target domain.This process transforms the phishing behavior recognition challenge into a graph classification problem.The source network is denoted as represents the set of interaction edges, A s ∈ R n s ×n s rep- resents the adjacency matrix, F s v ∈ R n s ×c s n represents the node feature matrix, with c s n being the number of features for each node, F s e ∈ R m s ×c s e represents the edge feature matrix, with m s = |E s | and c s e being the number of features for each edge, Y s = {(v i , y i )|v i ∈ V s } represents the label set for some labeled address nodes, where if y i = 1 , the node is a phishing node, otherwise y i = 0 .Similarly, the target network is denoted as . In this approach, r-AIG is used as both the source and target networks.
A subgraph centered around v r is defined as g v r ⊂ G r , where r ∈ {s, t} , the source domain is )) , and the target domain is )) , with D s = D t , and f (g) being the subgraph classification task.The prob- lem studied in this paper is, given a distributional difference but identical label sets between source and target domains, to learn a classification function f (g) → y i with the aid of source domain information, in order to accurately identify the category of g t v r in the target domain.

Cross-blockchain-network phishing behavior detection method
In this section, we offer a comprehensive overview of the ADA-Spear phishing detection method's structure, as illustrated in Fig. 2. ADA-Spear is applicable both for phishing behavior recognition and mining within the same blockchain, as well as for cross-blockchain-network phishing detection.When examining a node v, ADA-Spear takes the r-AIG g v centered around v as input and outputs the node's label.A label of 1 indicates a phishing node, while 0 represents a benign node.ADA-Spear comprises the following components: Firstly, to characterize on-chain interaction behavior, an r-AIG centered around v is constructed.Secondly, to depict the complex behavior of node v, an r-AIG encoder is designed, incorporating temporal feature analysis and attention mechanisms.Thirdly, an adversarial domain adaptation transfer Fig. 2 The framework of cross-blockchain-network phishing behavior detection method learning module is developed, adapting the encoder to the address behavior representation on the target chain and identifying phishing behavior on the target chain.Finally, a training module is designed, integrating domain adaptive transfer learning with binary classification tasks for training.

Subgraph construction
Ethereum and Bitcoin data both reach million-level scales, resulting in excessively large graph data for their respective AIGs.This poses a challenge for full-batch training in graph nerual network (GNN)-based neural networks.Furthermore, an abundance of redundant data hinders effective transfer learning, making the migration of source domain features to the target domain challenging.This may result in issues like overfitting, impacting the effectiveness of phishing detection.Consequently, this subsection aligns and simplifies E-AIG/EO-AIG and B-ATIG, creating a unified r-AIG.The subgraph construction process is illustrated in Fig. 3.

Sampling original data into multigraphs
In this subsection, the second-order nodes of the target address are sampled from the original block data of Ethereum and EOSIO to create the subgraph representing the target address, resulting in the multigraph E-AIG.Given Bitcoin's distinct organizational structure from Ethereum, the original block data for Bitcoin involves sampling the second-order transaction nodes of the target address and the next-order addresses corresponding to these transactions.This process results in the creation of a heterogeneous graph containing transaction nodes and address nodes, forming the multigraph B-ATIG.
The design of the sampling method outlined above is driven by two key considerations: a. Phishing and benign nodes exhibit distinct behavioral patterns, manifested in their interactions with neighboring nodes, i.e., local structural features on the graph; b.Subgraphs, in comparison to the full graph, undergo a significant reduction in scale and are adaptable to the training method of the model network.
To facilitate adaptation to the later-designed encoder and adversarial domain adaptive learning, aligning EAIG/ EO-AIG with B-ATIG is essential.This involves transforming B-ATIG into B-AIG.The transformation process is shown in Fig. 4. In the heterogeneous graph B-ATIG, all nodes of type "transaction" n tx ∈ N tx and their corre- sponding edge pairs (n tx pre , n tx ), (n tx , n tx next ) are extracted.By removing n tx , the edge (n tx pre , n tx next ) is formed.The attributes of the edge (n tx , n tx next ) are retained as new edge attributes.This process concludes with the construction of the homogeneous multigraph B-AIG.

Node and edge feature construction
Following the analysis of phishing behaviors, features exhibit strong correlations with amount, time, frequency, and degrees.The feature construction where in and out represent the in-degree and out-degree of the node, respectively.Edge attributes are represented as f e = (t, a), f e ∈ F e , where t indicates the order of the transaction occurrence in the subgraph.Specifically, t = rank(time, g i ), i ∈ {1, 2, . . ., count} , with time being the transaction timestamp, g i being the subgraph con- taining the target node, and count being the total number of nodes.Additionally, a represents the transaction amount type attribute.

Merging interactions into a directed graph
To transform multigraphs into directed graphs, edge aggregation is employed, as shown in Fig. 4b, c.For clarity, the figure only displays the aggregation method for the sum of amounts.Given the temporal characteristics of phishing behaviors, the graph preserves crucial temporal information of transactions to the fullest extent.Multiple parallel edges between two nodes are aggregated into a single edge, and the attributes on the edge (v i , v j ) are ((�a t1 , Max(a t1 ), Min(a t1 ), Mean(a t1 )), . . ., (�a tM , Max(a tM ), Min(a t M ), Mean(a t M ))) where a tM , Max(a tM ) , Min(a tM ) , Mean(a tM ) , m ∈ M respectively represent the sum, maximum, minimum, and mean of all transaction amounts at the mth moment, and M indicates the maximum occurrence sequence of transactions in the subgraph.
It should be noted that different target nodes will form different subgraphs, and the number of transactions within different subgraphs may vary.Therefore, the dimensions of the edges in different subgraphs may differ after aggregation.Subsequently, the issue of dimension alignment will be addressed by using a variable-length long short-term memory (LSTM).

Graph reduction with directed edges
Although sampling second-order nodes reduces the data volume compared to the full blockchain interaction graph, the data volume remains significant.Therefore, further reduction of the directed AIG, as shown in Fig. 5, occurs while retaining information relevant to phishing detection.In this subsection, the TopK technique is utilized to sample nodes at each order.The strategy is based on the following considerations: When comparing nodes of the same order, nodes with fewer zeros in the feature vector of the edges they belong to (i.e., fewer zeros), a larger sum of all amounts in the vector (i.e., larger sum), and higher degrees (i.e., higher degree) are more important and have a higher probability of being retained.In each order, the weight at which each node is retained, denoted as w, can be expressed as: where each node utilizes the sum of all incoming edges and outgoing edges c zero + a all + d , hop v i denotes the current order of node v i , g v i represents the subgraph where v i belongs.After the final iteration and reduction, the resulting node set can be represented as: In conclusion, a reduced sub-directed graph is obtained.For the reduction process in this paper, we choose K = 25.

Encoder network
This section presents a comprehensive overview of the ADA-Spear model's feature extractor f encod (A, F v , F e ; θ encod ) .It employs a hierarchical atten- tion mechanism based on LSTM to map the behavioral features of target nodes into representation vectors for subsequent transfer learning.The schematic diagram is illustrated in Fig. 6.Additionally, the encoder, in conjunction with the prediction head described in the next section, can also be applied for phishing detection within the same blockchain domain.In the following, the specific details of the encoder will be presented.
(1) Since AIG is a graph centered around addresses, and each node has both incoming and outgoing transactions with its neighboring nodes, it is essential to preserve the temporal characteristics of transaction information.Therefore, in this subsection, the edge features between the central node v i and its neighboring nodes are aggregated into node v i .The input vector of the central node v i is represented as f serves as the input for a variable-length LSTM, resulting in embedded vectors h t v i .This operation is performed for all nodes on the graph, resulting in h t v i , where v j represents the neighbors of v i .Since the total number of time steps m may vary between different subgraphs, a variable-length LSTM is employed to adapt to m .Figure 6a presents a visualiza- tion of this process.
Specifically, for any node v k , k ∈ {i, j} , f ′ v k is used as the LSTM input, and the forgetting gate and input gate are calculated as follows: Then, we calculate the current candidate cell state: Combine the forget gate and input gate to update the current cell state as follows: (3) Fig. 6 The architecture of ADA-Spear encoder network At time step t, the current cell's output hidden layer is expressed as: Here, σ represents the sigmoid activation function, and t takes values from the set {t 1 , t 2 , . . ., t m }.
Node Level Address Representation Based on Address Source Data on Blocks Subsequently, the node features The same process is applied to other neighboring nodes.At this point, each node has aggregated edge features into the node and has fused them with the original node features, preserving transaction information based on time sequences.
Node Level Address Representation Based on Spatial Behavior After obtaining the features of each node, this step aims to retain the most relevant neighbor interaction information between nodes.In phishing behavior, each neighboring node contributes differently to phishing detection.For example, a phishing address may engage in small transactions with benign addresses to obfuscate its malicious behavior, which can interfere with the detection process and should be excluded.Leveraging this insight, the paper utilizes the graph attention mechanism (GAT) mechanism to understand the distinct contributions of each neighbor node to phishing behavior detection and learn the hidden layer representation between each pair of nodes.The process is shown in Fig. 6a.
Specifically, for any node v i in the subgraph, the attention score between it and its neighbor node v j is calculated as follows: where W a represents the weight to be learned in a single- layer feedforward network, and LeakyRelu is used as a non-linear activation layer for subsequent normalization operations.
To ensure that the attention scores between nodes are comparable, a normalization operation is performed: where N (i) represents all neighboring nodes of node v i .
Neighboring nodes are aggregated based on their attention scores to obtain the representation of node v i : where W α represents the weight of the linear layer to be learned, and σ is a non-linear activation function, with ReLU chosen in this case.After iteration, the node-level embedding h v i is obtained for each node.
The specific iteration process is shown in Fig. 7. Iteration is performed using the graph attention layer at the hop level, which involves propagating, transforming, and aggregating the representations between nodes in each subgraph.This process allows the interaction behavior of nodes at each level to be fully embedded into vectors, where hop represents the order of each subgraph.The initial input to the iteration layer is the node-level representations obtained from the temporal feature extraction, denoted as h 0 v i = h v i , where the neighboring nodes include all first-order nodes in the subgraph.After hop iterations, the final node-level representations for all nodes in a subgraph, denoted as contain all the infor- mation about second-order nodes in that subgraph.

Subgraph-level behavior pattern representation
This module aims to characterize the behavior patterns of each target subgraph.The identity and methods of the phisher can lead to varied subgraph differences in phishing behavior on Ethereum.Therefore, it is necessary to design subgraph-level feature characterization.To overcome the limitations of flat GNNs), this subsection combines Diff-Pool technology (Ying et al. 2018) hierarchically to aggregate subgraph information, as shown in Fig. 6c.Specifically, the representation H (0) g = H hop g and the subgraph adjacency matrix A (l)  g are used as inputs to the Diff-Pool layer, which calculates the representation matrix Z (l) g and the assignment matrix S (l) g ∈ R n l ×n l+1 , where n l is the number of nodes in the l th DiffPool layer: where GNN are learnable parameters.( 12) where d is the number of columns in Z (l) g , which is the feature dimension of nodes in the l th layer.
We use two layers of DiffPool to characterize subgraph behavior and finally obtain the subgraph-level behavior pattern representation H g , which serves as the output of the encoder.

Label prediction
Both the source domain and target domain use the methods mentioned in Sect.4.2 to represent subgraphs as H r g , r ∈ {s, t} .The semi-supervised learning classification module is designed to perform binary classification separately on the source and target domains, preparing for the subsequent domain adaptation module.
Specifically, we employ a multi-layer perceptron f pred (H; θ pred ) , where H is the representations of all sub- graphs, and θ pred represents the trainable parameters of the classifier.The predicted labels can be expressed as: Subsequently, we use the cross-entropy loss function for the training of the classifier f pred : Similarly, this loss function can also be computed on the target domain.
The specific iterative process of the address encoding encoder is as Fig. 7. Subgraphs' A , F v and F e serve as inputs to the LSTM, resulting in the representation h .After pass- ing through two layers of GAT and two layers of DiffPool, the address representation H g is obtained.Multi-layer per- ceptron (MLP) is employed for address representation classification.The original label set Y is used as the input to the loss function, which is then compared with the labels predicted by MLP to update the encoder's parameters.

Adversarial domain adaptation
We use a adversarial domain adaptation module to eliminate the divergence between the Ethereum, Bitcoin, and EOSIO networks, facilitating knowledge transfer between these networks.For each subgraph in each network, we ( 14) y s i log y s i obtain subgraph representations H r g , r ∈ {s, t} through the encoder.To generate similar representations in the source and target domains, we use a fully connected layer as the domain discriminator, denoted as f disc (H r g ; θ disc ) , with input H r g = f encod [(A r , F r v , F r e ; θ encod )] g , represent- ing the representation of subgraph g, and θ encod as train- able parameters.The output is a real number indicating the similarity between the source and target domains.
Next, we let the generator f encod and domain discrimi- nator f disc play against each other, making it impossible for the domain discriminator to distinguish the domain of the samples.Specifically, we first compute the optimal transportation distance also named Wasserstein distance (Arjovsky et al. 2017;Gulrajani et al. 2017) between the source and target domain distributions: where �f disc � L c ≤ 1 enforces the Lipschitz continuity condition on the domain discriminator to prevent gradient explosions or vanishing between the generator and domain discriminator, and sup represents the supremum.Furthermore, the optimal transportation distance maximizes the domain discriminator loss function under this condition, where the domain discriminator loss function is given by: This loss encourages the domain discriminator to correctly classify the source and target domain representations while the generator aims to generate domain-invariant representations.The overall objective is to minimize L disc while maximizing W 1 (P H s , P H t ) , which promotes domain adaptation and similarity between the source and target domain representations.
To ensure the Lipschitz continuity condition, a gradient penalty factor L penal is introduced to θ disc : where the representation H refers to a random point along the line between the representations of the source and target domains, or the source or target domain itself. (18) (20) Therefore, the subgraph representation is maintained constant by minimizing and maximizing the following: where γ is the gradient penalty coefficient, set to zero during the training of the generator.Generally, the parameters in the domain discriminator f disc (•) are trained to optimality before updating the generator f encod (•) parameters to minimize the optimal transport distance.

Model training
Expanding formula ( 22) and integrating it into the semisupervised learning prediction loss function yields the final loss function: where is the balancing coefficient between semi-supervised learning and domain adaptation.
The training process is depicted in Algorithm 1.

Experiments
In this section, the performance of the proposed method is assessed in comparison to current advanced methods through experiments.The analysis includes evaluating the efficiency and scalability of the method.Following that, it investigates diverse cross-blockchain-network phishing behavior patterns.Additionally, it assesses the effectiveness of different modules in the adversarial domain adaptation architecture and explores the impact of inter-chain distribution differences on the model.

Data preparation
In this subsection, the most popular two blockchain platforms and the largest Initial Coin Offering (ICO) platform are selected as the subjects of the experiment: Ethereum (Etherscan)(ETH), Bitcoin (Nakamoto 2008) (BTC), and EOSIO (EOSIO)(EOS).Transaction data from the Ethereum blockchain for the years 2019-2021, Bitcoin for 2019-2021, and EOSIO for 2018-2019 are extracted.From each chain, 1000 phishing nodes are randomly sampled, and to balance the training samples, the same number of benign nodes are also obtained.Subgraphs are constructed for each using the method described in Sect.4.1.Detailed statistics of the dataset are shown in Table 2.
For the three datasets, the reduced second-order subgraphs of phishing and benign nodes are used as the experimental input.The Ethereum dataset contains a total of 1,765,559 nodes and 4,319,271 edges, with the phishing nodes accounting for 0.06% of the total; the Bitcoin dataset includes 983,176 nodes and 4,859,188 edges, with phishing nodes making up 0.10%; the EOSIO dataset consists of 746,688 nodes and 9,727,013 edges, with phishing nodes comprising 0.13%.Due to EOSIO's characteristic of rapidly recording interaction information, it presents the highest total number of edges and average degree despite having the fewest total nodes.However, the proportion of phishing nodes in EOSIO is comparable to the other datasets, which does not affect the subsequent experimental analysis.
To further assess the performance of ADA-Spear across these three domains, it is divided into six transfer learning tasks, which include: ETH → BTC, EOS → BTC, ETH → EOS, BTC → EOS, BTC → ETH, and EOS → ETH, where ETH, BTC, and EOS respectively represent the Ethereum dataset, the Bitcoin dataset, and the EOSIO dataset.

Experiment setup
This subsection primarily describes the baseline methods and implementation details.

Baselines
We select methods from different research approaches for comparative experiments, mainly divided into feature engineering, single-network learning, graph-based semisupervised learning, and cross-network learning.The specifics are as follows: Feature Engineering Inspired by Chen et al. (2020a), we select 219-dimensional features to characterize subgraph behavior for phishing behavior detection.
Single Network Learning We use DeepWalk (Perozzi et al. 2014), Graph2vec (Narayanan et al. 2017), Trans-2vec (Wu et al. 2020), and T-Edge (Lin et al. 2020) as comparative methods for extracting subgraph representations.DeepWalk and Graph2vec can extract the structural information of subgraphs, while Trans2vec and T-Edge incorporate additional information such as transaction amounts and timing on top of structural information.
After obtaining subgraph (feature) representations using feature engineering and single network learning methods, logistic regression, random forests, and support vector machines are used for the subgraph classification task.
Graph-based Semi-supervised Learning We employ GCN (Kipf and Welling 2016), GraphSage (Hamilton et al. 2017), and MCGC (Zhang et al. 2021) as three deep learning methods, detecting phishing behavior in an end-to-end learning manner.GCN can integrate network structure and attribute information.GraphSAGE is derived from a variant of the GCN aggregation function.MCGC is a deep learning method that extracts subgraph information hierarchically.The attributes required for this class of methods are consistent with those proposed in this subsection.
Cross-Network Learning We use NetTr (Fang et al. 2013) and CDNE (Shen et al. 2020) as two transfer learning methods.NetTr transfers only network structural information.CDNE introduces a MMD loss function to perform domain adaptation learning in an autoencoder fashion.The attributes required for this class of methods are consistent with those proposed in this subsection.

Implementation details
This subsection conducts experiments on a Linux operating system with 128 GB of memory.NetworkX ( 2020) is used for graph data processing, PyTorch (Paszke et al. 2019) for model construction, and Scikit-learn (Pedregosa et al. 2011) for handling evaluation metrics.
To better compare the effectiveness of the methods, the dimensions of the subgraph representations used in this subsection are all the same, each with 128 dimensions.The encoder of the method proposed in the paper consists of 1 layer of LSTM, 2 layers of GAT, 2 layers of DiffPool, and 2 layers of MLP, with each layer using a 128-dimensional output for hidden layers.The dropout rate for hidden neurons is set to 0.4.The function f pred (•) is a multilayer perceptron, using a 128-dimensional output layer for label prediction.The domain discriminator f disc (•) has 2 layers with 128 neurons each, and the balance coefficient , the gradient penalty coefficient γ , and the number of training steps for the domain dis- criminator s disc are set to 0.8, 10, and 15, respectively.The learning rates for the encoder and domain discriminator, α encod and α disc , are both set to 0.001.The model is trained for 30 epochs.
For all non-cross-network learning methods, we merge the source and target networks into one network for experimentation.In the merged network, there are no edges between the source and target networks.Therefore, an 8:2 split is used as the training and test sets within each merged network, with fivefold cross validation used for experimentation.For single network learning methods, the walking length for DeepWalk, Trans2vec, and T-Edge is set to 10, and the context size is set to 4. The shift parameter α in Trans2vec and T-Edge is set to 0.5.In Graph2vec, the number of training epochs is set to 30, with a learning rate of 0.001.For subsequent classifiers, the maximum number of iterations is set to 100 in logistic regression.The support vector machine uses an RBF kernel with hyperparameter γ set to 1000.In the random forest, the maximum depth is set to 7, with the number of base decision trees set to 100.
For graph-based semi-supervised learning methods, the GCN layers in GCN, GraphSage, and MCGC are set to 128 dimensions.The number of training epochs is set to 30, with a learning rate of 0.001.For MCGC, the number of aggregation layers is set to 3 layers.
For cross-network learning methods, NetTr and CDNE adopt the hyperparameters recommended in the literature.

Experimental results
This section primarily investigates the comparative effectiveness of the proposed method against current advanced methods.To ensure that the behavior patterns of the three blockchains are sufficiently similar to warrant transfer learning, the target domains in the experiments are divided into fully unlabeled and partially labeled (with a label rate of 5%).In this section, ADA-Spear's detection effectiveness is verified to surpass current advanced methods across these three datasets and six transfer learning tasks.The main results are presented in terms of F1 scores, as shown in Table 3.Firstly, from the table for the target domain with no labels, it can be seen that the proposed method ADA-Spear outperforms all comparison methods, demonstrating its effectiveness in cross-blockchain-network phishing behavior detection.Under the six cross-blockchain-network detection tasks, ADA-Spear's F1 score is, on average, 7.1% higher than the best comparison method.
The F1 scores of feature engineering and single network learning methods are comparatively low, which aligns with expectations.Feature engineering methods can only depict certain types of phishing behaviors on a specific chain and struggle to comprehensively characterize phishing behaviors across different chains.
Within single network learning methods, it's evident that DeepWalk and which only consider network structure, perform very poorly, even comparable to feature engineering methods.This can be attributed to the lack of interconnectivity between ETH, BTC, and EOS networks, resulting in incomparable representation vectors trained from the source and target domains.This is the reason for the subpar detection effectiveness.Moreover, because they do not incorporate any semantic characterization of on-chain phishing behavior, they cannot compare with Trans2vec and T-Edge, which introduce semantics.Semantic-aware detection methods outperform those considering only network structure by an average of 7.25% in F1 score, suggesting that phishing behaviors across heterogeneous chains share certain semantic similarities.
Graph-based semi-supervised learning methods show significant improvement over the previous two categories, with the MCGC method achieving the best results among them.These methods are, on average, 10.04% higher in F1 score compared to single network learning methods.This improvement can be attributed to the end-to-end learning mode of semi-supervised methods, which allows for the adjustment of subgraph behavior representation while classifying, rather than post-representation classification as with the previous two unsupervised learning approaches.MCGC achieves the best results because it overcomes the flattening issue of GCNbased methods, considering subgraph representation from a hierarchical perspective, and portraying behaviors more completely and accurately at the subgraph level.
The cross-network learning methods NetTr and CDNE are the most effective among all comparison methods but still have a significant gap compared to the detection effectiveness of the proposed method in this paper.ADA-Spear's F1 score exceeds the average of cross-network learning methods by 7.1%.Especially NetTr did not achieve the expected detection performance.This may be due to NetTr only transferring the topological structure between the source and target domains, not considering the semantic information of behaviors, which also indirectly proves that phishing behaviors on the three different chains have significant topological drift.The substantial improvement in the detection effectiveness of ADA-Spear is mainly due to the integration of more semantic information on phishing behavior, hierarchical stereoscopic behavior characterization, and the loss function in adversarial domain adaptation being more effective than the MMD loss function.
From these six transfer learning tasks, it is observed that transfers to EOS are more challenging, which indirectly suggests that the behavior patterns of nodes on EOS differ significantly from those on other chains.However, the detection results are still at a high level, which indicates that although there are distribution differences in behavior, the method proposed in this paper can still mitigate distribution drift and achieve good detection results.
Comparing the tables for target domains with and without labels shows that introducing target domain labels improves the detection performance of all models, with many phenomena similar to those observed in the absence of labels.While detection performance improves with the introduction of labels, the increase is not substantial, which also demonstrates the robustness of the model.Furthermore, it indicates that phishing behaviors across these three chains possess certain similarities and that the distributions have consistency.This validates the meaningfulness of using domain adaptation methods.

Efficiency analysis
This section conducts an efficiency analysis.Firstly, an analysis of the relationship between time and cost across different sampling ranges was carried out, with the results shown in Fig. 8.This subsection uses the ratio of the number of nodes and edges to the total number of nodes and edges as a cost indicator for analysis.Since the task duration remains the same when the datasets used in the source and target domains of transfer learning are swapped, so nodes and edges increasing are showing  Therefore, the method proposed in this tends towards linearity in both time and cost aspects, indicating that ADA-Spear is suitable for large-scale networks and has good scalability.
Secondly, this section also compares the impact of the sampling range on the F1 score for ADA-Spear with two other methods, with results shown in Fig. 9.The other two methods selected, GCN and CDNE, were chosen because they achieved the optimal F1 scores among the comparison methods.As can be seen from the figure, the F1 scores for all six tasks are generally optimized when K = 25; therefore, K = 25 was chosen for experimentation.Additionally, it is evident from the figure that the F1 score for ADA-Spear consistently outperforms that of GCN and CDNE.The detection effect is not ideal before K reaches 20, but there is a significant improvement after K exceeds 20.However, there is a slight decline when K reaches 30.This indicates that the neighboring nodes have the maximum amount of information when K is between 20 and 25, and beyond 30, the redundant information becomes detrimental to the model's learning.
In summary, the method proposed in this section has the best efficiency in balancing time and detection effectiveness, surpassing the other methods.

Ablation experiment
This section investigates the impact of the adversarial domain adaptation module and the encoder module on the overall model's detection performance and conducts a visual analysis.

Impact of the adversarial domain adaptation module
To explore the effect of this module on the overall detection performance, the subsection conducts experiments without this module, and the comparative results are presented in Table 4.As can be seen from the table, the adversarial domain adaptation module mitigates domain discrepancies and enhances detection performance.The absence of this module would lead to a significant distributional shift, whereas ADA-Spear can make the subgraph representations have clearer class boundaries.This demonstrates that the adversarial domain adaptation module can effectively alleviate the distribution drift between chains, making a substantial contribution to the detection of phishing activities on new blockchains.

Impact of the encoder network
To explore whether the encoder network proposed in this paper accurately captures the distinctive characteristics of phishing and benign behaviors, this section  5.As can be seen from the table, the implementation of the encoder designed in this paper leads to improvements in all six cross-blockchain-network phishing detection tasks.The averages of Rc, Pr, and F1 increased by 2.85%,  2.81%, and 2.83%, respectively.This demonstrates that the encoder can effectively delineate the distinctive features between phishing and benign behaviors, making it an indispensable part of the model.

Distribution difference analysis
This section discusses the differences in distributions between the source chain and the target chain.Under six cross-blockchain-network phishing detection tasks, the impact of changes in the distribution differences on the detection results is explored by varying the feature overlap degree between the source and target domains.The feature overlap degree is defined as C = |F s ∩F t | |F s ∪F t | , where F r (with r ∈ {s, t} ) represents the features in domain r .This is achieved by randomly removing certain attributes, causing C to vary from 10% to 50%.The final results are shown in Fig. 10.The best methods of semi-supervised learning and cross-network learning are selected for comparison with the proposed method.As seen in the figure, ADA-Spear consistently outperforms GCN and CDNE across all label rates in all tasks.This also indicates that, even with only a small portion of overlapping attributes between two domains, adversarial domain adaptation techniques still make a significant contribution to the model.This adequately demonstrates the robustness of the method proposed in this paper, making it capable of detecting phishing behaviors across an even wider range of new chains.

Sensitivity analysis
This section conducts a parameter sensitivity analysis of ADA-Spear to explore the impact of hyperparameters on the model.The sensitivity of the LSTM length m , the training steps of the domain discriminator s disc , the pen- alty factor γ , and the balance coefficient are analyzed.Since the trends in hyperparameter changes are similar across various tasks, this section presents the F1 values only for the ETH → BTC task to avoid repetition.The method of controlling variables is used here; when studying a single hyperparameter, the others are kept constant as described in Sect.5.2.
LSTM Length m The length of LSTM is used to receive subgraph feature inputs of different total time steps.As shown in Fig. 11a, subgraphs with a time span of more than 30 days have achieved good and stable detection results.The detection results show stability when LSTM encompasses information over more days, demonstrating the robustness of the model.When the length is 10, the detection performance is lower, indicating that subgraphs should ideally not be chosen with less than 10 steps.The stable performance in the graph also proves that using LSTM of variable lengths for detection is feasible.11b, there is a clear increase in the F1 value when s disc changes from 5 to 10, followed by a tendency to stabilize.This confirms the optimization theory of the domain discriminator.Since the parameters of other components in the model remain unchanged during the training of the domain discriminator, as long as the number of training steps for the domain discriminator is sufficiently large, it can reach the optimal solution.
Penalty Factor γ The penalty factor is used to adjust various hyperparameters of the domain discriminator.As can be seen from Fig. 11c, the optimal detection performance is achieved when γ = 10 .Both excessively high and low values lead to model degradation.Therefore, this paper adopts a penalty factor of 10 for the experiments.
Balance Coefficient The balance coefficient is used to adjust the balance between adversarial domain adaptation learning and semi-supervised learning.As shown in Fig. 11d, the detection results improve when varies between 0.4 and 1.0.There is a sharp decline in detection performance after = 1.0 .Therefore, the chosen for the model should be between 0.4 and 1.0 to balance the learning of distinguishability between different categories and the similarity between domains.

Conclusion
This paper addresses the detection of cross-blockchainnetwork phishing activities by combining an encoder that effectively characterizes phishing behaviors with ADA-Spear's adversarial domain adaptation module.The encoder module combines temporal and topological information to create a hierarchical subgraph representation.Simultaneously, the adversarial domain adaptation module facilitates knowledge transfer from the source domain to the target domain.This combination enables ADA-Spear to be effectively applied to new chains for recognizing phishing behavior.Detailed experiments, using Ethereum, Bitcoin, and EOSIO as examples, further demonstrate ADA-Spear's advantages in accuracy and robustness.Meanwhile, ADA-Spear exhibits certain limitations.The precision of ADA-Spear demonstrates a considerable improvement over existing methods; however, there is still room for further enhancement until it can rival the detection performance achieved solely within a single chain.Additionally, ADA-Spear may encounter limitations in knowledge transfer in cases where there is significant disparity between the two chains.Research on generalizable phishing recognition methods across different chains is still in its early stages.This paper introduces one potential approach, offering a new research perspective to the community.

Fig. 1
Fig. 1 Cross-blockchain-network phishing detection diagram: known features and labels of the source network are used to detect nodes with unknown labels in the target network

Fig. 8
Fig. 8 Impact of sampling range on time and cost (training time, node-to-edge ratio): a ETH → BTC task; b BTC → ETH task; c EOS → BTC task; d BTC → EOS task; e ETH → EOS task; f EOS → ETH task

Fig. 9
Fig. 9 Impact of sampling range on detection results: a ETH → BTC task; b BTC → ETH task; c EOS → BTC task; d BTC → EOS task; e ETH → EOS task; f EOS → ETH task

Fig. 10
Fig. 10 Comparison chart of detection efficacy with different overlap rates of attributes between source and target domains: a ETH → BTC; b EOS → BTC; c ETH → EOS; d BTC → EOS; e EOS → ETH; f BTC → ETH

Table 1
Main symbols used in our framework s , G t Graph in the source and target domain V s , V t Nodes in the source and target domain E s , E t Edges in the source and target domain A s , A t Adjacency matrix in the source and target domain s , N t The number of subgraphs in the source and target domian f encod , f pred , f disc Optimal encoder, semi-supervised learning classifier and domain discriminator θ encod , θ pred , θ disc Parameters for encoder, clssifier and discriminator H s , H t Addresses representation in the source and target domian The optimal encoder f encod , semi-supervised learning classifier f pred and domain discriminator f disc .1: Initialize parameters θ encod for encoder f encod ; 2: Initialize parameters θ pred for classifier f pred ; 3: Initialize parameters θ disc for discriminator f disc ; 4: while not converge do λL penal ( Ĥ)}; θ ← {θ encod , θ pred }; 18:

Table 2
Dataset statistics

Table 3
Comparison of F1 scores of ADA-Spear and other methods' detection results, with optimal results within all methods indicated by bolded style and optimal results within each type indicated by wavy lines

Table 3
(continued) same trend in the tasks such as ETH-BTC: ETH → BTC and BTC → ETH but time cost between the tasks have slight fluctuation.It can be observed from the figure that the time increases with the number of sampled nodes, and the increase tends to be linear.Since graph data needs to be read, the ratio of nodes and edges represents the memory usage rate, which, as shown in the figure, increases linearly with the number of sampled nodes. the

Table 4
Impact of the adversarial domain adaptation module on detection performance

Table 5
Impact of the encoder network on detection performance Domain Discriminator Training Steps s discAs observed in Fig.