NBA: defensive distillation for backdoor removal via neural behavior alignment

Recently, deep neural networks have been shown to be vulnerable to backdoor attacks. A backdoor is inserted into neural networks via this attack paradigm, thus compromising the integrity of the network. As soon as an attacker presents a trigger during the testing phase, the backdoor in the model is activated, allowing the network to make specific wrong predictions. It is extremely important to defend against backdoor attacks since they are very stealthy and dangerous. In this paper, we propose a novel defense mechanism, Neural Behavioral Alignment (NBA), for backdoor removal. NBA optimizes the distillation process in terms of knowledge form and distillation samples to improve defense performance according to the characteristics of backdoor defense. NBA builds high-level representations of neural behavior within networks in order to facilitate the transfer of knowledge. Additionally, NBA crafts pseudo samples to induce student models exhibit backdoor neural behavior. By aligning the backdoor neural behavior from the student network with the benign neural behavior from the teacher network, NBA enables the proactive removal of backdoors. Extensive experiments show that NBA can effectively defend against six different backdoor attacks and outperform five state-of-the-art defenses.


Introduction
Recent years have seen the use of deep learning for a wide range of critical tasks, such as autonomous vehicle driving (Grigorescu et al. 2020;Muhammad et al. 2021), facial recognition (Hu et'al. 2015;Wang and Guo 2021), machine translation (Costa-jussà 2018; Koehn 2020), etc.As deep learning expands its application scope, its security issues are also garnering increased attention (Berman et al. 2019;Liu et al. 2021;Guowen et al. 2019).Deep neural networks are regarded as key components of deep learning, and their security has always been emphasized in research.It is expensive and time consuming to train a deep neural network, so many users use Machine Learning as a Service (MLaaS) (Ribeiro et al. 2015) or directly download post-trained networks from the Internet.In this case, a third party handles the training of the network.An honest third party will train normally and return a clean model, however, there is also the possibility for a malicious third party to manipulate the training process and return a tainted model.Due to the black-box nature of neural network (Rudin 2019), users cannot determine whether the model has been maliciously modified.Service features of MLaaS and the black-box nature of the models offer the possibility of backdoor attack.
The backdoor attack (Gao et al. 2020) consists of two phases, namely the implanting phase and the activating phase.A backdoor is implanted during the training of the neural network, for example by tampering with the training data, and it is then activated during the testing of the network.Backdoor attacks have the main characteristic that the network will make specific incorrect predictions only when triggers are presented in the input, otherwise it behaves normally.As is shown in Fig. 1, the "STOP" sign will be predicted as "LIMIT 50" when the image recognition system predicts an image stamped with a trigger.When the system is applied to auto driving, this kind of backdoor behavior could contribute to serious traffic accidents.As mentioned earlier, a malicious third party is well positioned to implant a backdoor and return the backdoored network to the user.Users are also provided with a partially clean data set when they receive the network in order to test whether the it performs as expected.Nevertheless, the backdoor in the network cannot be activated by clean data, i.e., a user cannot determine whether the network contains a backdoor.
The defense based on knowledge distillation is currently considered to be the most effective method for mitigating backdoor attacks.NAD [57] was the first to introduce knowledge distillation into backdoor defense.It utilizes attention features to represent the neural network's internal neuron activation information and achieves backdoor defense by aligning the intermediate layer attention features of the student network and the teacher network.The limitation of NAD is that it only involves same-order attention features during knowledge distillation, while the correlation among attention features of different orders is ignored.On this basis, ARGD [58] proposes the attention relation graph, which fully considers and utilizes the relationship between attention features of different orders.As a result, the defense performance is further improved.They have a common limitation, that is, they only optimize knowledge representation, and this knowledge representation is too single.Knowledge distillation was originally proposed because of the need to quantify the network, so we argue that simply optimizing knowledge representation is far from enough to defend against backdoors.
In this work, we propose a new defense mechanism called NBA, which simultaneously optimizes knowledge representation and training samples according to the characteristics of backdoor defense.In terms of knowledge representation, NBA defines and extracts three types of neural behaviors from within the neural network to fully represent the knowledge of the network.By optimizing the corresponding loss function, the student network can be encouraged to align its neural behavior with that of the teacher network, resulting in better training results.In contrast, the knowledge representation used by NAD and ARDG can essentially be regarded as one kind of neural behavior used by the NBA.In terms of training samples, we construct pseudo poisoned samples and input them to the student network.After the backdoor neural behavior is exposed, NBA can remove the backdoor more thoroughly.Based on the above optimizations, NBA can achieve better defensive performance than NAD and ARGD.
In summary, we make the following contributions: • We propose novel forms of knowledge and extract neural behavior as efficient representations of knowledge to be transferred.Based on the alignment of neural behavior between both teacher and student networks during defensive distillation, the latter can achieve better learning results than other distillationbased defenses (Li et al. 2021;Xia et al. 2022).• We optimize original training samples into pseudo samples that can induce student network to exhibit backdoor neural behavior.On this basis, the backdoor in the student network can be further removed actively when combined with a neural behavioral alignment mechanism.• We conduct extensive experiments on a number of well-known backdoor attacks.The experimental results corroborate the effectiveness and generality of our approach.

Backdoor attack
We refer to a neural network that has been implanted with a backdoor as a backdoored network, and refer to a sample that has been injected with a trigger as a poisoned sample.The backdoored network exhibits backdoor behavior when it takes poisoned sample as input, namely make specific wrong prediction.
Existing backdoor attack can be divided into poisonlabel attack and clean-label attack according to whether the label of the poisoned sample is modified.Poison-label attack require the attacker to modify both the samples and the labels, so that the mapping between the trigger and the target label can be directly established.BadNets (Gu et al. 2017) is the first and most representative poison-label attack.The subsequent poison-label attacks are intended to improve the BadNets from the perspective of trigger design (Liu et al. 2018(Liu et al. , 2020)), trigger implanting (Chen et al. 2017), and others.The clean-label attack (Turner et al. 2019;Barni et al. 2019) is designed to solve the phenomenon of inconsistent semantics of poisoned samples and labels, and these methods often need to add additional constraints on samples from target labels.
In this paper, we choose well-known methods from clean-label attack and poison-label attack for experiment, so as to fully illustrate the generality and effectiveness of NBA.

Backdoor defense
Existing defense schemes can be divided into certified defenses and empirical defenses.Certified defenses (Weber et al. 2020;Jia et al. 2022) can theoretically ensure a certain degree of robustness, but their assumptions tend to be strong, they are not as effective as empirical defenses in practical situations.According to purpose and object of defense, empirical defenses can be classified into four categories, including (1) poisoned sample detection (Zeng et al. 2021;Hayase et al. 2021), (2) trigger (injected into sample) invalidation (Qiu et al. 2021;Doan et al. 2020), (3) network detection (Xu et al. 2021;Zheng et al. 2021), and (4) backdoor (implanted into network) removal (Liu et al. 2018;Wu and Wang 2021).Since the purpose of the defense is to prevent the poisoned sample from activating the backdoor, the defense only needs to be implemented on either side of the input and the model.In the first two types of methods, the input side is protected by detecting the poisoned sample or by destroying triggers in the input.The latter two types of methods defend on the model side by detecting the backdoored network or removing backdoors in it.
We argue that a backdoor attack stems from the backdoor implanted in the model, thus a defense scheme that removes the backdoor can effectively solve the problem of backdoor attacks.In general, NBA aim at eliminating backdoor from the backdoored network.Based on our proposed neural behavior alignment and pseudo-poisoned sample, NBA can further remove backdoors while improving the benign performance of the backdoored network.

Knowledge distillation
As a classic deep learning technique, knowledge distillation is often used in the fields of neural network quantization and transfer learning.In knowledge distillation, a well-trained network is usually used as a teacher network, and a network that lacks training is called a student network.The teacher network guides the student network to learn, and study have shown that under this learning paradigm, the student network can achieve better results than learning by itself (Hinton et al. 2015).In most scenarios where knowledge distillation is used, the structure of the teacher network will be more complex than that of the student network, but the study of Furlanello et al. (2018) shows that the student network can even achieve better performance than the teacher network when thet have the same architecture.Hinton et al. ( 2015) first introduced knowledge distillation in deep learning, and they used soften predictions as the knowledge to be transferred.After that, there is a lot of work to improve the efficiency of knowledge distillation by designing new knowledge to be transferred.Representative improvement works include using intermediate feature (Romero et al. 2015;Zagoruyko and Komodakis 2017), using relationship feature (Yim et al. 2017;Park et al. 2019), using structure feature (Liu et al. 2020;Xixia et al. 2020), etc.
Based on Furlanello et al. (2015), Hinton et al. ( 2018), we argue that knowledge distillation can be applied to backdoor removal.Existing work confirms this, and they have achieved good results in defensive distillation with attention maps (Li et al. 2021) and corresponding improvements (Xia et al. 2022).Accordingly, defensive distillation may provide a promising method of defending against backdoors.In addition, Ge et al. (2021) have considered the backdoor failure problem that may be caused by knowledge distillation, and proposed targeted optimization.However, since its threat model and attack scenarios are not consistent with those discussed in this article, we will not analyze it.

Threat model
We consider a common scenario, where the training process of network is outsourced to a third party.It applies to the case where the user downloads the trained model directly from the Internet or customizes the trained network through MLaaS.
The attacker is free to implant backdoors into the network in any manner he chooses.Different trigger patterns can be designed, different labels can be set, and poisoning rates can be set arbitrarily.Here, we assume that the network was successfully implanted with a backdoor and returned to the user.The user is often provided with a partially clean dataset so that they can confirm the usability of the network once it has been returned (or released).The network is expected to perform well on this dataset.
There are no details about the training process and attack methods available to the defender.He is only provided with a trained network and a small portion of clean dataset.Due to the fact that it is unknown whether a given network contains a backdoor, the proposed defense method should be network-agnostic.The defensive solution should remove the backdoor from a given network without significantly degrading the its normal performance if it is a backdoored network.Particularly if the network is clean, the defense mechanism should not significantly affect its performance.

Overview
Figure 2 illustrates the pipeline of using NBA for backdoor defense.It consists of two steps: first, finetuning the backdoor network in order to obtain the teacher network, and then, through defensive distillation, aligning the neural behavior of the student network to remove the backdoor.
As shown in Fig. 2a, the defender fine-tunes the given network using a local clean dataset in order to obtain the teacher network.Figure 2b illustrates the subsequent defensive distillation step.As it was originally proposed for the purpose of neural network compression, knowledge distillation will not provide good results when it is applied directly to the backdoor defense.To obtain satisfactory defense performance, we optimize knowledge representation and training samples in knowledge distillation in accordance with backdoor defense features.
The improvements we have made to defensive distillation have been inspired by real-life teaching experiences.As an analogy, we compare the behavior of the backdoor network in processing samples to that of students in solving problems.Consequently, knowledge distillation can be viewed as the process by which teachers instruct students in the proper method of resolving problems.Two lessons can be drawn from practical teaching experience.
Firstly, teachers should provide students with a complete understanding of problem solving, including intermediate steps, intermediate answers, and final solutions.The student will not be able to fully comprehend the ins and outs of the correct method of solving problems if any or all of these are missing.In order to simulate this process, NBA extracts and aligns three kinds of neural behavioral within student and teacher networks.Each of these three neural behaviors within neural network has its own focus, and the combination can facilitate comprehensive learning by the student network of the knowledge imparted by the teacher network.Secondly, teachers will provide concentrated problem-solving guidance for students' error-prone problems and correct the students' faulty problemsolving methods.Inspired by this, we constructed pseudo samples to induce the student network to actively exhibit backdoor neural behaviors, aligning them with the benign neural behavior of the teacher network to remove the backdoor more effectively.
Based on this, we propose learning distillation loss and unlearning distillation loss, which are used to encourage Fig. 2 The overview of NBA.NBA consists of two main procedures to remove backdoor: (1) fine-tuning: fine-tuning based on local clean data to obtain the teacher network; (2) defensive distillation: extracting and aligning high-level representation of neural behavior from teacher network and student network.Backdoor will be eliminated from student network by optimizing the two kinds of distillation loss functions adopted in defensive distillation phase the student network to learn benign knowledge efficiently and to actively unlearn backdoor knowledge.

Neural behavior
Definition and extraction of the neural behavior of the neural network are the keys to NBA.We define two types of neural behaviors, namely, response neural behavior and learning neural behavior, respectively, for the intermediate answers and intermediate steps in the problemsolving process.Figure 3 shows the procedure of there two kinds of extracting neural behavior.In addition, we introduce the dark knowledge proposed by Hinton et al. (2015) as prediction neural behavior to represent the final answer of the problem.

Response neural behavior
Previous studies (Zagoruyko and Komodakis 2017;Li et al. 2021;Xia et al. 2022;Romero et al. 2015) have shown that feature maps can represent the response of neurons inside the network to input samples.We extracted the feature maps of each intermediate layer of the network and regarded it as the original representation of the response neural behavior of the model.In order to capture the focus of the response neural behavior, inspired by Gatys et al. (2016), we use the gram matrices to capture the key features of the feature maps.In particular, the gram matrices that we get here are called response matrices, and they can be used as the high-level representation of the responsive neural behavior.
The feature maps of in the l-th layer of the teacher network and the student network are denoted by where C l , H l , W l represent the the number, height and width of feature maps in l-th layer.Further, the response matrices in the l-th layer of the teacher network and the student network can be denoted as G Fl ∈ R c×c and G Sl ∈ R c×c .Response matrix G l is the inner product of the feature maps of the corresponding neural network: L RNB is defined by the mean squared error between the response matrices G Fl and G Sl , l = 1, 2, . . ., n: where L l RNB is defined as By optimizing the Eq. ( 2), student network is encouraged to align its own response neural behavior with these of the teacher network.

Learning neural behavior
Learning neural behavior is used to simulate the intermediate problem-solving steps in actual teaching.
Learning neural behavior is defined as the transformation between response neural behavior from adjacent layers, which is defined as: when M l is learning matrix between l-th layer and l + 1 -th layer.M Tl and M Sl can be calculated by the above equation.
Once there exists n response matrices, there will be n − 1 learning matrices.By optimizing the cross entropy loss function as follows, we encourage the student network to aligning its own learning neural behavior with teacher's:

Prediction neural behavior
We introduce the dark knowledge proposed by Hinton as the prediction neural behavior.Usually, the output of the (1) (2) (5) Fig. 3 The illustration of extracting response neural behavior and learning neural behavior model is obtained by processing the logits information through softmax.The neural behavior that can be represented in this output is limited.To this end, we follow the procedure of Hinton et al. (2015), introducing a temperature T to soften the result of the softmax as the prediction neural behavior.The prediction neural behavior for class i can be calculated by the follows: where z i and z j are logits, k is the number of classes.Here we set T = 5.
Therefore, the alignment of prediction neural behavior can be implemented by optimizing the KL-divergence between the prediction neural behavior of the teacher network and the student network: where u and z are the logits of teacher network and student network respectively.
By optimizing the L PNB , the student model is encouraged to align its own prediction neural behavior with that of the teacher.

Learning distillation loss
Based on the three losses defined in "Neural behavior" section, we define NBA learning distillation loss L LDL to encourage the student network to fully learn the knowledge transferred by the teacher network during defensive distillation: where D is local dataset, T and S are teacher and student networks.Specifically, T (•) and S(•) represent different knowledge form in different loss.i (i = 1, 2, 3) are hyperparameters controlling the weights of each loss item.They are set as 2.0, 2.0 and 0.1, respectively.

Unlearning distillation loss
Unlearning distillation loss is proposed to correct the backdoor behavior of the student network and improve its generalization.The core idea behind it is to construct pseudo samples, and input the original samples and pseudo samples into the teacher network and student network respectively.By optimizing this loss function, the student network will further eliminate the (6) backdoor behavior, that is, actively unlearn the backdoor knowledge.
Student network typically only show backdoor behavior in the presence of poisoned samples, but defenders only have clean samples.We introduce adversarial attacks to address this challenge.
Typically, the backdoored network can be obtained by optimizing the following loss function during training: where l(•) denotes the loss function such as cross-entropy loss, D c and D p are denote the subsets of training dataset.Particularly, △ denotes the trigger and y t denotes the tar- get label.The function of the second item is to implant the backdoor.Optimizing this loss function actually creates a shortcut in the network for the decision-making process of recognizing the input as the target label compared to training a clean neural network.Potential adversarial attacks are thus affected.
We conduct non target attack on the student network to craft pseudo samples x ′ = x + δ , where δ is adversarial per- turbation, which can be obtained by optimizing: The above equation can be solved by gradient-based adversarial methods, such as Madry et al. (2018).According to different predicted labels, pseudo samples can be divided into two categories.The samples that are predicted as target labels are pseudo-poisoned samples, and they will converge to the local extremum caused by the backdoor during the optimization.This means that δ and △ are strongly related.Figure 4 visually shows the fea- ture maps generated by pseudo-poisoned samples and poisoned samples.They have strong similarities, which indicates that they are both able to activate the network ( 9) (10) to exhibit similar behavior.Another type of samples obtained by the above equation are called pseudo clean samples, and their predictions are inconsistent with the target label.Unlearning distillation loss is defined as follows: It is important to note that during the optimization of the above equation, the two types of pseudo samples have varying roles, but both can contribute to the enhancement of the defense performance.In particular, the pseudo-poisoned samples induce the student network to exhibit backdoor behavior, allowing backdoor knowledge to be removed more efficiently.While the pseudo-clean samples essentially provide a regularization function.

Total loss
Overall objective of defensive distillation has the form of: where α controls the weight of distillation loss in total loss, and β controls the weight of unlearning distillation loss in distillation loss.We set α = 1.0 and β = 0.5 in this paper. (11)

Experimental settings Attack setups
We conduct experiments on 6 representative backdoor attacks, which have their own distinct characteristics in trigger design (BadNets (Gu et al. 2017), TrojanNN (Liu et al. 2018) and Refool (Liu et al. 2020)), trigger injection (Blend (Chen et al. 2017)), and label modifying (CLA (Turner et al. 2019) and SIG (Barni et al. 2019)).The poisoned samples constructed by there methods are shown in Fig. 5.We follow the settings given in the original paper, such as trigger patterns and target labels.We evaluate all attacks and defenses on CIFAR10 and GTSRB, with WideResNet (WRN-16-1) (Zagoruyko and Komodakis 2016).Other attack details are described in Table 1.

Defense setups
We compare our approach with the state-of-the-art defenses, including Fine-tuning (FT) (Papernot et al. 2016), Fine-pruning (FP) (Liu et al. 2018), Mode Connectivity Repair (MCR) (Zhao et al. 2020), Neural Attention Distillation (NAD) (Li et al. 2021), and Attention Relation Graph Distillation (ARGD) (Xia et al. 2022).The defenses can be compared fairly since they are based on the same threat model.Consistent with previous work (Li et al. 2021;Xia et al. 2022), we assume that the defender has a  5% clean training set.We set batch size 64, initial learning rate 0.1 and train network using the Stochastic Gradient Descent (SGD) optimizer with a momentum of 0.9.

Metrics
We use attack success rate (ASR) and benign Accuracy (BA) to evaluate the effectiveness of the defenses.In particular, the lower the ASR and the higher the BA, the better the defense method.
• Attack Success Rate (ASR) This metric measures the proportion of poisoned testing set predicted to the target class.• Benign Accuracy (BA) This metric measures the proportion of clean testing set predicted the groundtruth classes.

Effectiveness of NBA
We present the detailed results on the comparison of performance in Table 2. Overall, our proposed method achieves good defensive performance.We further illustrate this from two aspects.
We first analyze the defensive effects of NBA against different attacking methods.It is noted that NBA is always effective on different attack methods, i.e., lower ASR and higher BA can be achieved.This demonstrates NBA's impressive Second, we conducted a comparison between the defenses.FT and three methods based on knowledge distillation (NAD, ARGD, and NBA) achieve better results.The results of a further analysis revealed two important findings.(1) The scheme of knowledge distillation is better than FT.It is due to the fact that the latter relies solely on loss functions such as cross-entropy loss for selflearning, while the former introduces distillation loss and can benefit from the guidance of a teacher network.(2) In the internal comparison of distillation schemes, NBA achieves the best performance (average ASR is 1.52, and BA is 81.14).The reason for this is that NAD's distillation loss is only based on attention maps extracted from feature maps, and ARGD's distillation loss takes into account the order relationship between attention maps.Essentially, they are equivalent to the response neural behavior.NBA, however, adopts two different types of distillation loss simultaneously.The first type of distillation loss is learning distillation loss, which utilizes three types of neural behavior, including response neural behavior, as the form of knowledge, and is capable of leading to better learning results.There is also the unlearning distillation loss, which is capable of actively reducing backdoor neural behaviors.As a result, NBA has a significant advantage when it comes to reducing ASR and maintaining BA.
In addition, it is worth noting that although ARGD performs better than NAD on average, its BA value (80.35) is lower than NAD's (80.47) when defending against attacks such as BadNets.This indicates that the improvement achieved by ARGD is limited.In contrast, NBA outperforms both ARGD and NAD in terms of defense performance, whether defending against specific attack methods or on average.In terms of the degree of improvement, NBA is able to consistently optimize the defense performance (including BA and ASR) by at least 1 percentage point at the margin, achieving the best defense performance.
Furthermore, we find that although FT, NAD, and ARGD do not adopt a loss function similar to NBA's unlearning distillation loss, they can still reduce ASR to a certain extent.It is important, however, to stress that for these schemes, the reduction of BA relies on the "Catastrophic Forgetting" Effect (Goodfellow et al. 2014;Kirkpatrick et al. 2017) of the neural network, rather than actively removing backdoors.

Effectiveness under different defender's capacity
According to the threat model presented in this paper, defender capabilities are primarily determined by the size of the local dataset.Here we investigate the effect of local dataset size on defense performance.Figures 6 and 7 illustrate that most defense schemes perform better as the size of the clean dataset increases.It should be noted, however, that FP is an exception.As can be seen from Fig. 6, its ASRs at 5% is similar to that at 20%.This suggests that the size of local dataset does not significantly affect the defense performance of FP.This is because more clean samples do not help FP to more accurately identify whether neurons are damaged.
With only a very small dataset (1%), NBA fails to perform as well as ARGD.However, as the dataset becomes larger, its advantages gradually become apparent.Specifically, with a dataset size of 5%, NBA achieves the best performance among all defenses.
Further observations show that the defense performance gap between these defenses except for FP narrows as the dataset size increases.This indicates that dataset size can indeed affect defense performance in a significant way.NBA, however, continues to have a significant advantage in this case.In the experimental data of 20% data set size, although NBA's BA is similar to that of other schemes, its ASR is still very low compared to other schemes.
The dataset provided by the third party with the trained networks are usually not very large at any one time.We therefore argue that our assumption of the size of the local dataset (i.e. 5% of the training set) is reasonable.

The advantage of neural behavioral representations
Here, we investigate how different neural behavior representations affect defense performance.Feature maps are treated as low-level representations in this study, while gram matrices are treated as high-level representations.
The results are shown in Figs. 8 and 9. Generally speaking, Gram matrices-based NBA has outperformed feature maps-based NBA.In spite of the fact that the feature maps represent the neural behaviors directly inside the model, there is too much redundant information within them.By contrast, Gram matrices can capture neural behavioral knowledge more effectively through inner products of feature maps, so they can be used to achieve better results of knowledge transferring.

Ablation study of distillation loss
NBA combines three kinds of neural behavior and two forms of loss function to effectively remove backdoor.
Here we perform two ablation study to demonstrate that none of them can be omitted.
In Table 3, we show that applying each neural behavior can achieve a certain defensive effect.However, when only one or two kinds of neural behaviors are used, the defenase scheme cannot achieve the best performance.
For learning distillation loss and unlearning distillation loss, we provide Table 4, which shows the performance of different settings.The results in Table 4 demonstrate that a reasonable overall defense performance can be obtained when only learning distillation loss is used.The backdoor can be removed more efficiently by unlearning distillation loss (as indicated by the lower ASR), but at the cost of the lower BA.By reasonably adjusting coefficient β in Eq. ( 12), we can achieve a better trade-off between ASR and BA.

Possible settings for unlearning distillation loss
Crafting pseudo samples is the key for unlearning distillation loss.The rationale and necessity of the pseudo sample crafting approach are covered in this section.
The pseudo samples are replaced with poisoned samples to perform additional experiments.Table 5 shows the experimental results.The first row presents the results of defensive distillation using only the learning distillation loss, and the corresponding ASR is reduced to 3.2.This indicates that the backdoor has been largely removed.The last two rows of the Table 5 display NBA's results using poisoned samples and pseudo samples.Both methods can further reduce the ASR (1.38 and 1.52, respectively), indicating that introducing unlearning distillation loss can indeed effectively remove the backdoor.It should be noted that there exists diminishing marginal effect in terms of removing the backdoor.There is not much difference between using poisoned samples and using pseudo samples in unlearning distillation loss.Several defense methods are capable of accurately reversing engineering the approximate poisoned sample (Qiao et al. 2019;Wang et al. 2019;Tao et al. 2022), but this is unnecessary for our approach given that even poisoned samples cannot provide a significantly better performance).In addition, these schemes require high computational overhead or attack details such as trigger size (Wang et al. 2019) when crafting potential poisoned samples.

Conclusion
This paper presents NBA, a novel defensive distillation mechanism for backdoor removal.We optimize the knowledge distillation process from both the knowledge form and training samples to make it better suited to the defense scenario.In terms of knowledge form, we extract and align three kinds of neural behavior of networks to achieve efficient knowledge transfer.In terms of training samples, we construct pseudo samples to further eliminate backdoor from the backdoored network.
To the best of our knowledge, NBA is the first active defensive distillation mechanism and has competitive advantages in terms of backdoor removal.NBA's highly effective defense performance and realistic threat model make it an attractive candidate for practical defensive scenarios.

Fig. 1
Fig. 1 Example of a backdoor attack

Fig. 4
Fig. 4 Visualization of feature maps extracted from backoored network

Fig. 5
Fig. 5 Examples of clean samples (top) and poisoned samples (bottom) used in our experiments

Fig. 6
Fig. 6 The ASRs of 6 defenses under different size of clean data

Table 1
Settings of 6 well-known backdoor attacks

Table 3
Ablation results of neural behavior loss

Table 4
Ablation results of distillation loss

Table 5
Performance comparison of different samples