Skip to main content

DLP: towards active defense against backdoor attacks with decoupled learning process


Deep learning models are well known to be susceptible to backdoor attack, where the attacker only needs to provide a tampered dataset on which the triggers are injected. Models trained on the dataset will passively implant the backdoor, and triggers on the input can mislead the models during testing. Our study shows that the model shows different learning behaviors in clean and poisoned subsets during training. Based on this observation, we propose a general training pipeline to defend against backdoor attacks actively. Benign models can be trained from the unreliable dataset by decoupling the learning process into three stages, i.e., supervised learning, active unlearning, and active semi-supervised fine-tuning. The effectiveness of our approach has been shown in numerous experiments across various backdoor attacks and datasets.


In recent years, deep learning technology has been widely applied across a number of domains, dramatically improving the efficiency of tasks such as object recognition (Eitel et al. 2015; Wang et al. 2015), semantic segmentation (Lateef and Ruichek 2019; Garcia-Garcia et al. 2018), speech recognition (Deng et al. 2013; Zhang et al. 2018), and machine translation (Costa-jussà and Escolano 2016; Vaswani et al. 2018). At the same time, the security of deep learning has also received attention.

Backdoor attack (Gao et al. 2020) has recently been proposed as a new attack paradigm. The attacker (or a malicious third party provider) can launch backdoor attack by providing a dataset injected with triggers. When the user trains a model directly on the dataset, the model will passively be implanted with a backdoor. In the testing phase, the model will predict a sample as a specific class if the trigger is present. Otherwise, it behaves normally. Since deep learning models do not behave differently without the trigger, the backdoor attack is very stealthy, which poses a threat to the practical application of deep learning. Defending against backdoor attacks effectively is an urgent issue.

A large amount of data is required to train deep learning models. However, due to the opaque nature of the data processing process, users have to fully trust datasets provided by others or collected from the Internet. Users can only take passive defensive measures for mitigation if attackers inject triggers into the dataset beforehand. Even though data-based defenses (Chen et al. 2019; Tran et al. 2018; Zeng et al. 2021) can detect poisoned samples, they cannot eliminate the threat since the backdoor is implanted into the model. There are also model-based defenses that can detect (Fields et al. 2021; Huster and Ekwedike 2021; Sikka et al. 2020) and mitigate (Li et al. 2021; Yoshida and Fujino 2020; Liu et al. 2018) backdoor threats, whereas the former can only discard the model after detecting the backdoor, while the latter requires further resources to repair. These passive defenses mitigate the threat of the backdoor attack, but they all share a common limitation: they cannot be implemented until the model has been trained. When a backdoor attack is detected and mitigated, significant training resources have already been wasted.

The paper proposes an active defense mechanism with a decoupled learning process (DLP) to mitigate this challenge. DLP decouples the standard learning process into three stages: supervised learning, active forgetting, and active semi-supervised fine-tuning. With decouple learning process, we can build a benign model from an untrustworthy dataset while balancing the effectiveness of backdoor removal and the usability of the model.

We argue that the model essentially learns both the backdoor task on the poisoned subset as well as the main task on the clean subset when it is trained on the tampered training set. We observe that in the early stages of standard learning, the model learns the backdoor task much better than it learns the main task. In response to this observation, we decoupled the standard learning process and set the first stage to supervised learning. Following this stage, clean and poisoned samples can be filtered using active learning (Ren et al. 2022). We then set up the active unlearning stage so that filtered samples are used as well as a gradient ascent algorithm in order to remove the backdoor. Furthermore, with the active semi-supervised fine-tuning phase, the usability of the model is further enhanced by combining the filtered clean samples with the semi-supervised approach. After three stages of learning, we can train a benign model on the tampered dataset. Since the decoupled learning process makes no assumption about the attack strategies for crafting poisoned samples, our proposed DLP is generic and applies to various attack methods.

Our main contribution are summarized as follows:

  • We reveal that model will show significant differences in learning behavior by treating the process of training the backdoor model on the tampered dataset as the joint learning process of the main task and the backdoor task. In light of this observation, we developed an active learning-based strategy for filtering the two types of samples accurately.

  • We propose a new defense against backdoor attacks called DLP, which actively trains benign models from tampered datasets. With DLP, the standard learning process can be decoupled into three stages, and a better usability-effectiveness trade-off can be achieved. The DLP is a simple yet powerful backdoor defense approach.

  • We evaluate DLP against five well-known backdoor attacks. As a result of extensive experiments, we can consistently show that DLP can deliver state-of-the-art defensive performance.

The organization of the paper is as follows. Section "Related works" introduces the related work, including backdoor attack, backdoor defense, active learning, and semi-supervised learning. Section "Threat model" describes the threat model. Section "Characteristic of backdoor learning" illustrates our observations on the learning process of backdoored models. Section "Method" elaborates on the proposed DLP and its crucial components. Section "Experiments" evaluates the proposed attack. Section "Conclusion" concludes this paper.

Related works

Backdoor attacks

Deep learning models are subject to backdoor attack, which is a novel attack paradigm. It is possible for an attacker to embed a backdoor into a model by tampering with the training dataset. As a result of training on the tampered dataset, the backdoor is automatically inserted into the model. We refer to the training process in this scenario as backdoor learning. During the test phase, backdoor in the model will be activated by triggers, leading the model to make wrong predictions.

The BadNets proposed by Gu et al. (2017) demonstrated the feasibility of the backdoor attack for the first time, and the subsequent work improved the attack based on the results. Liu et al. (2018) achieved lower poisoning rates by designing triggers that activate specific neurons more efficiently. Chen et al. (2017) proposed a blended injection method to improve the stealth of poisoned samples. Several other works aim to improve the stealth of triggers by choosing particular patterns as triggers, such as reflections (Liu et al. 2020), raindrops (Zhao et al. 2022), and adversarial perturbations (Zhang et al. 2021). Some works explore new ways of backdoor learning, such as training the trigger generator and backdoored model simultaneously to make the model automatically learn the trigger patterns (Cheng et al. 2021; Nguyen and Tran 2020; Salem et al. 2022). Poison-label attacks and clean-label attacks are further divided based on whether the labels of the poisoned samples are modified. The latter does not require modifying the labels of the poisoned samples, making them harder to detect.

Backdoor defenses

Many approaches have been proposed to defend against backdoor attacks. Due to the importance of data and model in deep learning, backdoor defense schemes can be divided into data-based defenses and model-based defenses. In data-based defenses, the defender can detect anomalous samples by neuron activations (Chen et al. 2019), spectral signatures (Tran et al. 2018), frequency features (Zeng et al. 2021), or training losses (Huang et al. 2022). There are also methods of reconstructing samples (Kwon 2021; Doan Bao et al. 2020) to undermine the validity of triggers in order to prevent backdoors from being activated. Model-based defenses employ discriminative features to detect backdoors, including signatures of model weights(Fields et al. 2021), transferability of adversarial perturbations (Huster and Ekwedike 2021), and counterfactual attribution (Sikka et al. 2020). Additionally, defenders can repair backdoored models through knowledge distillation (Li et al. 2021; Yoshida and Fujino 2020), pruning (Liu et al. 2018), or fine-tuning (Mu et al. 2022).

Li et al. (2021) first investigated how to train a benign model on the tampered dataset. After filtering the poisoned samples using training losses, they first improve the model performance by supervised learning and then remove the backdoor using filtered samples. Although they have developed a promising approach, it is not without limitations: firstly, their filtering method is insufficiently accurate, and secondly, the model’s performance on the main task is further reduced when the backdoor is removed at the end of the process. Inspired by Li et al. (2021), we propose DLP, which can enhance defensive effects and compensate for the above weaknesses. We discuss the design details in Sect. "Methods".

Active learning

The critical problem to be solved in the domain of active learning is how to maximize the model’s performance by labeling a minimum number of samples. A crucial component of active learning is designing query strategies to identify samples that are difficult to predict and then handing them over to experts for labeling. There are two main principles behind classical query strategies: uncertainty and diversity. The uncertainty principle aims to identify samples for which the current model is least capable of predicting (Balcan et al. 2007; Holub et al. 2008). The diversity principle aims at finding samples that differ so that the information provided by the queried samples is comprehensive (Yang et al. 2015; Brinker 2003).

One of the main challenges in DLP is filtering out poisoned samples and clean samples from the training set. We argue that there are similarities between this and active learning because both need to design query strategies to find out the needed samples. It is important to note that the samples in our scenario are labeled, but due to the backdoor attacks, they are viewed as unlabeled samples. As a result, we need to design query strategies to filter some of them out and trust their label information. We argue that the most predictable samples are poisoned samples, and the most challenging samples are clean samples, so it makes sense to introduce active learning to solve the challenge.

Semi-supervised learning

Large amounts of high-quality labeled samples are necessary to train deep learning models well. Labeled samples, however, are often difficult and expensive to obtain in many real-world scenarios. In contrast, unlabeled samples are usually readily available and large. Semi-supervised learning aims to improve a model’s performance by using many unlabeled data and a small number of labeled samples. There are several semi-supervised learning methods, including pseudo-labeling methods (Lee 2013; Blum and Mitchell 1998), consistency regularization methods (Sajjadi et al. 2016; Miyato et al. 2019), graph-based methods (Iscen et al. 2019; Chen et al. 2020), and hybrid methods (Berthelot et al. 2019; Sohn et al. 2020).

In this paper, we introduce semi-supervised learning to address another challenge with DLP, namely how to improve further model performance given only a tiny number of high-confidence labeled samples.

Threat model

We follow the standard thread model of backdoor attack, in which the attacker controls the dataset, and the user trains the model on unreliable dataset (specifically provided from the attacker). The model will exhibit targeted misclassification following training when presented with samples containing the trigger.

Attacker’s capacity and goals

The attacker’s goal is to provide a tampered dataset, where the model trained from will be passively implanted backdoor. In this work, we consider the attacker has the maximum capability. With full access to the training dataset, the attacker can inject triggers into the dataset using any state-of-the-art backdoor attack strategy.

Defender’s capacity and goals

It is assumed that the defender can only control the model’s training process and does not know the attack pattern, target type, or other details of the backdoor attack. The assumption is consistent with a typical deep learning training scenario.

The defender aims to train benign models from the tampered dataset. For a suitable defense mechanism to accomplish this goal, the following properties are necessary:

  • Comprehensive: the defense can cover various backdoor attacks, regardless of trigger designs, injection methods, poisoning rate, and other factors that might be involved.

  • Usability-preserving: the defense has a negligible impact on the model performance of the main task. Particularly if the dataset is not tampered, the model trained with defense method should perform similarly to the model trained using standard method.


This paper focuses on a typical class of backdoor attacks, namely single-target attacks (also known as all-to-one attacks), which are more efficient. Any input with a trigger is recognized by the model as the same label in a single-target attack, i.e., the target label is unique. Given original training dataset \(D_o=\{x_i,y_i\}_{i=1}^n\), poisoning rate \(\alpha\), and target class \(y_t\), the original training dataset \(D_o\) can be divided into target subset \(D_t=\{x_j,y_j\}_{j=1}^m\) and clean subset \(D_c=\{x_q,y_q\}_{q=m+1}^n\), where \(m=n \times \alpha\). The attacker will inject triggers into all samples from \(D_t\) according to specified attack strategy, and obtained poisoned subset \(D_p=\{x_j^{'},y_t\}_{j=1}^m\), where \(x_j^{'}=x_j \otimes \Delta\), \(\otimes\) refers to injection method and \(\Delta\) refers to the trigger. We call this process dataset poisoning. The attacker then releases the tampered dataset \(D_m(D_m=D_p \cup D_c)\).

Once the user has trained the model with the dataset \(D_m\), the model will be implanted with a backdoor. In the test phase, the model will behave normally on clean samples while predicting poisoned samples as target class:

$$\begin{aligned} \left\{ \begin{array}{cc} y_i=f_w(x_i),&{} \\ y_t=f_w(x_i^{'}), &{} \end{array} \right. \end{aligned}$$

where \(f_w\) is backdoored model, w are model parameters, \(y_i\) is ground truth label of \(x_i\), \(x_i^{'}\) is the sample injected with the trigger.

That is the whole process of the backdoor attack. The defender aims to train a benign model \(f_{w^{*}}\) from \(D_m\). The benign models give consistent predictions regardless of the sample’s presence or absence of triggers:

$$\begin{aligned} y_i=f_{w^{*}}(x_i)=f_{w^{*}}(x_i^{'}) \end{aligned}$$

where \(w^{*}\) refers to the parameters of the benign model. Our strong threat model ensures the practical usage of DLP in real-world settings.

Characteristic of backdoor learning

Fig. 1
figure 1

Training accuracy of clean subset and poisoned subset

Fig. 2
figure 2

The main pipeline of DLP

The standard learning procedure of model on the tampered dataset \(D_m\) as follows:

$$\begin{aligned} \mathop {min}\limits _{w}\frac{1}{n}\sum \limits _{(x,y)\in D_m}L_{1}(f_{w}(x),y) \end{aligned}$$

where \(L_1\) is supervised learning loss function(e.g. cross-entropy loss function). The optimization of the Eq. 3 can be realized by backpropagation with stochastic gradient descent. The training process on \(D_m\) can be further refined as the training of the model on the clean subset \(D_c\) and the training on the poisoned subset \(D_p\):

$$\begin{aligned} \mathop {min}\limits _{w}\left(\frac{1}{m}\sum \limits _{(x,y)\in D_p}L_{1}(f_{w}(x),y)+ \frac{1}{n-m}\sum \limits _{(x,y)\in D_c}L_{1}(f_{w}(x),y)\right) \end{aligned}$$

We train the WideResNet-16-1 (Zagoruyko and Komodakis 2016) on the CIFAR10 (Krizhevsky and Hinton 2009) with representative backdoor attacks, BadNets (Gu et al. 2017) and SIG (Barni et al. 2019) respectively. These two attacks are typical strategies of poison-label attack and clean-label attack. We set the poisoning rate \(\alpha\) to 10% and the batch size to 128.

Figure 1 shows the changes in training accuracy for poisoned and clean samples during backdoor learning. In the early stage of training, poisoned samples demonstrate greater accuracy than clean samples. This phenomenon suggests that the model learns the backdoor task faster than the main task. To master the backdoor task, the model only needs to learn the mapping of triggers to target classes. In order to enhance the effectiveness of backdoor attacks, attackers tend to design triggers into easily learnable patterns, such as a fixed simple image (Gu et al. 2017) or an optimized set of pixels (Liu et al. 2018). As a result, the backdoor task can be learned faster when both tasks are learned simultaneously. Our observations are supported by Arpit et al. (2017).

A model’s prediction of samples will reflect the differences in learning behaviors described above. The model can predict poisoned samples more confidently, while clean samples have a lower confidence level. The active learning strategy we developed in DLP exploited this difference to filter out both types of samples.


Our proposed DLP is described in this section. We will start with the overview of DLP and its pipeline, followed by its critical component.


We make the following improvements to compensate for the limitations of Li et al. (2021). Firstly, we introduce active learning and develop a method based on predictive entropy to filter out desired samples with higher confidence. Secondly, we filter out clean samples and poisoned samples separately for subsequent use. Thirdly, we adjust the order of backdoor removal and model fine-tuning. We first remove the backdoor using a filtered poisoned subset and then fine-tune the model using a filtered clean subset.

It is important to note that the DLP uses semi-supervised fine-tuning instead of supervised fine-tuning. Due to the unknown details of the attack, such as the poisoning rate, we cannot completely filter out poisoned samples. In the case of supervised fine-tuning, a backdoor will once again be implanted in the model. By contrast, we can obtain benign samples with high accuracy with DLP, allowing us to improve model performance by semi-supervised fine-tuning without introducing a backdoor. With the above strategy design, DLP can achieve the best tradeoff between attack success rate and clean accuracy.

Specifically, DLP involves decoupling the model’s learning process into three stages. The first stage is supervised learning (SL), achieved by performing initial standard training on the whole tampered dataset. At the end of this phase, the model will overlearn the backdoor task but not fully learn the main task. With this differential behavior, DLP can filter out poisoned samples and clean samples with a high level of confidence. The second stage is active unlearning (AU), which finally achieves backdoor removal by maximizing the same loss function as last stage. Active semi-supervised fine-tuning (ASSFT) is the final stage, in which the filtered clean subset is viewed as a labeled dataset to improve model performance on the main task.

Figure 2 illustrates this pipeline. Section "Entropy-based filtering method" details the method of filtering samples, and Sect. "Active unlearning" describes our active unlearning method for backdoor elimination. Section "Active semi-supervised fine-tuning" discusses the active semi-supervised fine-tuning for improving the model’s performance on the main task.

Entropy-based filtering method

As discussed in Sect. "Characteristic of Backdoor learning", the model obtained from initial supervised learning has difficulty predicting clean samples while easily predicting poisoned samples. We argue that in this scenario, the problem of accurately filtering the two types of samples is similar to what active learning works to solve: finding the samples that are most difficult (and easiest) to predict for the model.

We use Shannon entropy to represent prediction difficulty. The intuition behind this approach is that the model does not learn main tasks well, so predictions for clean samples are uncertain. Therefore, the probability of belonging to each class is almost the same in the corresponding prediction results, resulting in a higher entropy value. In contrast, poisoned samples have a lower entropy. The entropy of sample x can be expressed:

$$\begin{aligned} H(x)=-\sum \limits _{i=0}^{i=C}y^{i} \times log_{2}y^{i} \end{aligned}$$

where \(y^{i}\) is the probability of sample x belonging to class i, and C is the total number of classes.

We apply Eq. 5 to all samples to calculate entropy and then sort them in ascending order. We filter samples according to the given filtering rate \(\gamma\) based on the sorting results. The first \(n \times \gamma\) samples are filtered out as the filtered poisoned subset \(D_p^{'}\). And the last \(n \times \gamma\) samples are filtered out from each class as the filtered clean subset \(D_c^{'}\). Using this approach is both compatible with active learning’s diversity principle and with semi-supervised learning’s requirements.

Active unlearning

The model can learn the critical features required to perform the corresponding task by minimizing a predefined loss function through supervised learning. In the scenario of backdoor attack, the model learns the features that are required for the backdoor task (called backdoor features) and those that are needed for the main task (called clean features) by minimizing the loss function under the poisoned subset and clean subset, respectively.

Removing the backdoor from the model is equivalent to having the model unlearn the backdoor features. Learning and unlearning are mutually antagonistic processes, so we can unlearn the backdoor features by maximizing the loss function on \(D_p^{'}\). Here is the optimization objective for this stage:

$$\begin{aligned} \mathop {max}\limits _{w^{'}}\frac{1}{m}\sum \limits _{(x,y)\in D_p^{'}}L_{1}(f_{w^{'}}(x),y), \end{aligned}$$

where \(w^{'}\) indicates the weights of the model obtained after the initial supervised learning.

Fig. 3
figure 3

Clean sample and corresponding poisoned samples generated by different attacks

Active semi-supervised fine-tuning

The model does not fully learn the main task after the initial supervised training. The active unlearning process slightly forgets the clean features and further degrades the model’s performance on the main task. Due to these two reasons, fine-tuning is used to improve the model’s performance.

Before semi-supervised fine-tuning, we remove labels from all samples in dataset \(D_r^{'}(D_r^{'}=D_o-D_c^{'})\) to obtain dataset \(D_{ur}^{'}\). Then performing semi-supervised learning on dataset \(D_o^{'}(D_o^{'}=D_{ur}^{'} \cup D_c^{'})\). Formally, semi-supervised learning solve the following optimization problem:

$$\begin{aligned}&\mathop {min}\limits _{w^{*}}\frac{1}{n \times \gamma }\sum \limits _{(x,y)\in D_c^{'}}L_{1}(f_{w^{*}}(x),y)\\&\quad +\alpha \frac{1}{n-n \times \gamma }\sum \limits _{x\in D_{ur}^{'}}L_{2}(x,w^{*})\\&\quad +\beta \frac{1}{n}\sum \limits _{x\in D_{o}^{'}} \mathcal {R}(x,w^{*}) \end{aligned}$$

where \(L_2\) is semi-supervised loss function and \(\mathcal {R}\) is regularization. Weight \(\alpha\) and \(\beta\) denotes the trade-off. In particular, we use FixMatch (Sohn et al. 2020) to perform semi-supervised learning in DLP.

It is important to note that semi-supervised learning does not re-implant the backdoor in the model. Two reasons account for this: First, the unlabeled data will be performed on strong data augmentation, ultimately invalidating the trigger; second, poisoned samples lack labels, so the model cannot learn the association between triggers and target labels.


Experimental settings

Backdoor attacks

We consider five state-of-the-art backdoor attacks, including poison-label backdoor attacks, specifically BadNets (Gu et al. 2017), TrojanNN (Liu et al. 2018) and Blended (Chen et al. 2017), and clean-label backdoor attacks, in particular LCA (Turner et al. 2019) and SIG (Barni et al. 2019). The above attack methods are very representative, including using heuristic triggers, optimized generated triggers, improved trigger injection method, and invisible triggers by introducing adversarial perturbation and sinusoidal signal. An example of poisoned samples generated by different attacks is shown in Fig. 3.

Backdoor defenses

Four state-of-the-art backdoor defenses are considered as baselines, including FP (Liu et al. 2018), MCR (Zhao et al. 2020), NAD (Li et al. 2021), ABL (Li et al. 2021) and ANP (Wu and Wang 2021). They cover the mainstream backdoor removal directions: neuron-pruning based defense, mode connectivity based defense, and knowledge distillation based defense.


We evaluate the performance of all defenses against attacks in two common benchmark datasets, i.e., CIFAR10 (Krizhevsky and Hinton 2009), and ImageNet subset (Deng et al. 2009). In all the experiments, WideResNet-16-1 (Zagoruyko and Komodakis 2016) serves as the base model.

Other settings

We adopt the default configurations described in their papers to implement the attacks and defenses mentioned above. The defense method has access to a random subset of 5% of the clean testing set if necessary. Since the LCA attack on the ImageNet subset can not be successfully reproduced following the original paper, we omit the corresponding evaluation results. For our proposed DLP, we take FixMatch (Sohn et al. 2020) as the semi-supervised method. Besides, we adopt a SGD optimizer with a momentum of 0.9 and set the batch size 128 and the learning rate 0.01 as default. Specifically, we set crucial hyper-parameters supervised learning epochs \(E_1=10\), active unlearning epochs \(E_2=20\), filtering rate \(\gamma =1\%\) in all experiments.

Evaluation metrics

As is customary in the backdoor defense literature, we compute the two metrics to evaluate the performance of the defense: 1) attack success rate(ASR): the accuracy on the poisoned dataset, and 2) clean accuracy(CA):the accuracy on the clean dataset.

The above two metrics can measure the usability-effectiveness trade-off for backdoor defense. It is crucial for the defense mechanism to have a low ASR and a high CA, indicating that it can effectively resist backdoor attacks without adversely impacting the model’s performance on the main task.

Experimental results

Defense performance against backdoor attacks

Table 1 The defensive performance(%) of DLP against backdoor attacks on CIFAR10
Table 2 The defensive performance(%) of DLP against backdoor attacks on ImageNet subset

Tables 1 and 2 summarizes the effectiveness of the three stages of DLP against backdoor attacks. After the initial supervised learning, the model achieves an average ASR of 99.92%, while the average CA is 73.72%. When active unlearning is applied, the ASR decreases significantly, whereas CAs are only slightly affected. It is evident from the performance after active unlearning that the method is very effective at eliminating backdoors. Finally, BAs can be improved to over 90% while ASRs will only get negligible improvement by semi-supervised fine-tuning. It suggests that DLP provides a good trade-off between removing the backdoor and affecting the model’s performance on the main task.

As we observe, the models subjected to SIG and Blended attacks exhibit different results from the other three attacks after stage AU and stage ASSFT. Under SIG and Blended attacks, the CA of the models obtained from stage AU decreases more than the other three attacks, and the ASR of the models obtained from stage ASSFT is 0. We suspect this is because the poisoned samples produced by the above two attacks resembling mixed images, and it is hard to differentiate between backdoor features and clean features. Consequently, in stage AU, the model will unlearn both features, while in stage ASSFT, poisoned samples are more likely to be corrupted by data augmentation.

Defense performance comparison between DLP and baselines

Table 3 Performance(%) comparison between DLP and baselines on CIFAR10
Table 4 Performance(%) comparison between DLP and baselines on ImageNet subset

Tables 3 and 4 show the comparative quantitative results on the CIFAR10 and ImageNet subset, respectively. The best-performing numbers are highlighted in bold. Tables show that DLP can achieve better performance than other state-of-the-art defenses.

Before defense methods are applied, the models have both a high ASR and high CA, illustrating backdoor attacks’ effectiveness. To mitigate backdoor attacks, we employ different defenses and DLP. Generally, they all work to some extent. As FP prunes neurons that are also important for the main task, higher CA must be maintained at the expense of higher ASR, so FP cannot effectively defend. In contrast, MCR, NAD, and ABL get better defense performance. However, DLP is better than them. Take the statistics on CIFAR10 as an example. BadNets’s ASR can only be decreased by 95.44%, 98.01%, 95.87% with baseline methods NAD, MCR, and ABL respectively, a result worse than DLP’s 99.76%.

Furthermore, if we consider the trade-off of the defenses, DLP offers even more advantages. With DLP, CA can be maintained near the benign model level, while ASR can be reduced to almost zero.

We also compared the model’s performance before and after the application of DLP under the non-attack scenario, i.e., the training dataset is not tampered, to further validate the impact of DLP on the model. For reference, a model is trained on a non-tampered training set by a standard training method. The results show no significant difference in CA between the model obtained by DLP and the model obtained by standard training. As an active defense method, DLP can train a benign model with excellent performance regardless of whether the user is vulnerable to backdoor attack.

Further understanding of DLP

Trade-off between ASR and CA

Fig. 4
figure 4

Defense performance under different defense settings

DLP offers a good trade-off between ASR and CA. The model’s ASR is lowered to nearly zero through active unlearning and CA is increased to the maximum through active semi-supervised fine-tuning.

Figure 4 shows the final performance of other possible similar defense settings. The ASR is hardly reduced without active unlearning (in SL-ASSFT), which means that backdoors cannot be eliminated. The CA cannot be improved without ASSFT (in SL-AU), meaning that effectiveness is sacrificed at the expense of usability to implement defense. AU and ASSFT can be arranged in reverse order and still provide an effective defense (in SL-ASSFT-AU), but the best defensive performance cannot be achieved. AU will cause the model to slightly unlearn the clean features, resulting in a further decrease in CA after the peak.

By reviewing the above analysis, we can understand the rationality and effectiveness of the design of DLP.

Table 5 Defense performance under different poisoning rate and filtering rate settings on CIFAR10
Table 6 Defense performance under different model architecture settings on CIFAR10

Effectiveness of filtering method

Fig. 5
figure 5

Filtering performance with different methods

In this section, we compare the effects of different methods on filtering accuracy, namely Activation Clustering (AC) (Chen et al. 2019), Spectral Signature (SS) (Tran et al. 2018), training loss-based method (TLM) and ours. AC and SS are existing defense methods for detecting poisoned samples, and TLM is the classical active learning method. In Fig. 5, we can see that our method has the highest accuracy among all attacks. We find that the accuracy of AC and SS drops a lot when detecting complex triggers since these attacks give confusing feature representations. The filtering method based on active learning is better than AC and SS, but TLM performs worse than ours. A reasonable explanation is that our method considers the results of the model’s precision of all classes of the sample, which is more comprehensive compared to TLM.

Effectiveness of filtering rate \(\gamma =1\%\)

Considering the current default setting of poisoning rate \(\alpha =10\%\) for mainstream backdoor attacks, we set the filtering rate \(\gamma\) to 1% in the DLP.

The effect of the attacker’s different ability levels (reflected in the rate of poisoning) on the defense performance is examined first. Table 5 shows that, regardless of the poisoning rate, DLP can always maintain CA over 90%. Considering removing the backdoor, we can still significantly reduce the ASR to 4% even if the poisoning rate reaches 50%. Although DLP cannot completely remove the backdoor at this time, DLP is still effective in real-world scenarios. Backdoor behavior is easier to expose as the poisoning rate increases, so backdoor attackers will not use the high poisoning rate setting in practice.

Then how the defense performance would be affected by the filtering rate is examined. The defense performance is not promising when the filtering rate is set to 0.1%, as indicated in Table 5. Only 50 samples from the CIFAR10 dataset, for instance, are being used in the training process. The experiment result is poor as a result of the few samples. On the other hand, we noticed that when the filtering rate is set to 10%, the defense performance obtained by DLP is similar to that when the filtering rate is 1%. The phenomenon indicates that the primary determinant of defensive performance is no longer the sample size.

Given the above analysis, we believe that setting \(\gamma\) to 1% is reasonable.

Generality of the method

In this section, we experiment with other popular model architectures (VGG16 (Simonyan and Zisserman 2015), ResNet18 (He et al. 2016), InceptionV3 (Szegedy et al. 2016), MobileNetv2 (Sandler et al. 2018), DenseNet121 (Huang et al. 2017)) replacing the default WideResNet-16-1 (Zagoruyko and Komodakis 2016) while keeping other settings the same. As shown in Table 6, DLP has achieved good defense performance under different model architectures. This fully demonstrates the generality of our method.


In this paper, we identify the differential behavior of the model during backdoor learning, which led to significant differences in the prediction of the model for poisoned samples and clean samples. Based on these findings, we propose a new active defense mechanism called DLP. The DLP is based on the decoupled learning process and makes no assumption about the attack details such as poisoning rate and trigger pattern. With the DLP, one can train a benign model on the tampered dataset, preventing the model from being passively implanted the backdoor during training. Consequently, we can eliminate any further harm that may arise from the backdoored model before it occurs. Our experiments show that DLP is capable of defending against mainstream backdoor attacks and outperforms state-of-the-art defenses.

DLP is a very promising approach, but it still remains a work in progress. The main limitation of DLP is that it hardly mitigate multi-target attacks because differences in learning behavior are not reflected during its training. In future work, we will explore the commonality among backdoor attacks to develop a more comprehensive defense.

Availability of data and materials

The datasets used for the experiments are freely available to researchers. The links to the data have been cited as references.


  • Arpit D, Jastrzebski S, Ballas N et al. (2017) A closer look at memorization in deep networks. In: Precup D, Whye Teh Y (eds) Proceedings of the 34th international conference on machine learning, ICML 2017, Sydney, NSW, Australia, 6–11 August 2017, volume 70 of Proceedings of Machine Learning Research, pages 233–242. PMLR[Online]. Available:

  • Balcan MF, Broder AZ, Zhang T(2007) Margin based active learning. In: Nader HB, Claudio G (eds) Learning theory, 20th annual conference on learning theory, COLT 2007, San Diego, CA, USA, June 13–15, 2007, Proceedings, volume 4539 of Lecture Notes in Computer Science, pages 35–50. Springer[Online]. Available:

  • Barni M, Kallas K, Tondi B (2019) A new backdoor attack in CNNS by training set corruption without label poisoning. In: 2019 IEEE international conference on image processing, ICIP 2019, Taipei, Taiwan, September 22-25, 2019, pages 101–105. IEEE [Online]. Available:

  • Berthelot D, Carlini N, Goodfellow IJ et al (2019) Mixmatch: a holistic approach to semi-supervised learning. In: Wallach HM, Larochelle H, Beygelzimer A et al (eds) Advances in neural information processing systems 32: annual conference on neural information processing systems 2019, NeurIPS 2019, December 8–14, 2019, Vancouver, BC, Canada, pages 5050–5060.

  • Blum A, Mitchell TM (1998) Combining labeled and unlabeled data with co-training. In: Bartlett PL, Mansour Y (eds) Proceedings of the eleventh annual conference on computational learning theory, COLT 1998, Madison, Wisconsin, USA, July 24–26, 1998, pages 92–100. ACM.

  • Brinker K (2003) Incorporating diversity in active learning with support vector machines. In: Tom F, Nina M (eds) Machine learning, proceedings of the twentieth international conference (ICML 2003), August 21-24, 2003, Washington, DC, USA, pages 59–66. AAAI Press[Online]. Available:

  • Cheng S, Liu Y, Ma S, Zhang X (2021) Deep feature space trojan attack of neural networks by controlled detoxification. In: Thirty-Fifth AAAI conference on artificial intelligence, AAAI 2021, thirty-third conference on innovative applications of artificial intelligence, IAAI 2021, The eleventh symposium on educational advances in artificial intelligence, EAAI 2021, Virtual Event, February 2-9, 2021, pp 1148–1156. AAAI Press, [Online]. Available:

  • Chen X, Liu C, Li B, Lu K, Song D (2017) Targeted backdoor attacks on deep learning systems using data poisoning. CoRR

  • Chen B, Carvalho W, Baracaldo N et al (2019) Detecting backdoor attacks on deep neural networks by activation clustering. In: Huáscar E, SeánÓ, Xiaowei H, José H, Mauricio C-E (eds) Workshop on artificial intelligence safety 2019 co-located with the Thirty-Third AAAI conference on artificial intelligence 2019 (AAAI-19),Honolulu,Hawaii, January 27, 2019, volume 2301 of CEUR Workshop Proceedings.

  • Chen P, Ma T, Qin X, Xu W, Zhou S (2020) Data-efficient semi-supervised learning by reliable edge mining. In: 2020 IEEE/CVF conference on computer vision and pattern recognition, CVPR 2020, Seattle, WA, USA, June 13–19, 2020, pages 9189–9198. Computer Vision Foundation / IEEE. [Online]. Available:

  • Costa-jussà MR, Escolano C 2016) Morphology generation for statistical machine translation using deep learning techniques. CoRR, arXiv:abs/1610.02209

  • Deng J, Dong W, Socher R et al. (2009) Imagenet: a large-scale hierarchical image database. In: 2009 IEEE computer society conference on computer vision and pattern recognition (CVPR 2009), 20-25 June 2009, Miami, Florida, USA, pp 248–255. IEEE Computer Society.

  • Deng L, Hinton GE, Kingsbury B (2013) New types of deep neural network learning for speech recognition and related applications: an overview. In: IEEE international conference on acoustics, speech and signal processing, ICASSP 2013, Vancouver, BC, Canada, May 26–31, 2013, pp. 8599–8603. IEEE.

  • Doan BG, Abbasnejad E, Ranasinghe DC (2020) Februus: input purification defense against trojan attacks on deep neural network systems. In: ACSAC ’20: annual computer security applications conference, virtual event / Austin, TX, USA, 7–11 December, 2020, pp 897–912. ACM.

  • Eitel A, Springenberg JT, Spinello L, Riedmiller MA, Burgard W (2015) Multimodal deep learning for robust RGB-D object recognition. In: 2015 IEEE/RSJ international conference on intelligent robots and systems, IROS 2015, Hamburg, Germany, September 28– October 2, 2015, pp. 681–687. IEEE. [Online]. Available:

  • Fields G, Samragh M, Javaheripi M, Koushanfar F, Javidi T (2021) Trojan signatures in DNN weights. In: IEEE/CVF international conference on computer vision workshops, ICCVW 2021, Montreal, BC, Canada, October 11–17, 2021, pp 12–20. IEEE.

  • Gao Y, Doan BG, Zhang Z et al (2020)Backdoor attacks and countermeasures on deep learning: a comprehensive review. CoRR, vol. abs/2007.10760[Online].

  • Garcia-Garcia A, Orts-Escolano S, Oprea S et al (2018) A survey on deep learning techniques for image and video semantic segmentation. Appl Soft Comput 70:41–65.

    Article  Google Scholar 

  • Gu T, Dolan-Gavitt B, Garg S (2017) Badnets: identifying vulnerabilities in the machine learning model supply chain. CoRR

  • He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016 IEEE conference on computer vision and pattern recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 770–778. IEEE Computer Society[Online]. Available:

  • Holub A, Perona P, Burl MC (2008) Entropy-based active learning for object recognition. In: IEEE conference on computer vision and pattern recognition, CVPR Workshops 2008, Anchorage, AK, USA, 23–28 June, 2008, pages 1–8. IEEE Computer Society [Online]. Available:

  • Huang G, Liu Z, Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: 2017 ieee conference on computer vision and pattern recognition, CVPR 2017, Honolulu, HI, USA, July 21–26, 2017, pages 2261–2269. IEEE Computer Society, 2017. [Online]. Available:

  • Huang K, Li Y, Wu B, Qin Z, Ren K (2022) Backdoor defense via decoupling the training process. CoRR

  • Huster T, Ekwedike E (2021) TOP: backdoor detection in neural networks via transferability of perturbation. CoRR,

  • Iscen A, Tolias G, Avrithis Y, Chum O(2019) Label propagation for deep semi-supervised learning. In: IEEE conference on computer vision and pattern recognition, CVPR 2019, Long Beach, CA, USA, June 16–20, 2019, pp 5070–5079. Computer Vision Foundation / IEEE[Online]. Available:

  • Krizhevsky A, Hinton G et al (2009) Learning multiple layers of features from tiny images

  • Kwon H (2021) Defending deep neural networks against backdoor attack by using de-trigger autoencoder. IEEE Access

  • Lateef F, Ruichek Y (2019) Survey on semantic segmentation using deep learning techniques. Neurocomputing 338:321–348.

    Article  Google Scholar 

  • Lee DH et al. (2013) Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In: Workshop on challenges in representation learning, ICML, volume 3, page 896

  • Liu K, Dolan-Gavitt B, Garg S (2018) Fine-pruning: Defending against backdooring attacks on deep neural networks. In: Michael B, Thorsten H, Manolis S, Sotiris I (eds) Research in attacks, intrusions, and defenses - 21st international symposium, RAID 2018, Heraklion, Crete, Greece, September 10-12, 2018, Proceedings, volume 11050 of Lecture Notes in Computer Science, pages 273–294. Springer,

  • Liu Y, Ma S, Aafer Y et al. (2018) Trojaning attack on neural networks. In: 25th annual network and distributed system security symposium, NDSS 2018, San Diego, California, USA, February 18–21, The Internet Society, 2018.

  • Liu Y, Ma X, Bailey J, Lu F (2020) Reflection backdoor: a natural backdoor attack on deep neural networks. In: Andrea V, Horst B, Thomas B, Jan-Michael F (eds) Computer vision - ECCV 2020 - 16th European conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part X, volume 12355 of Lecture Notes in Computer Science, pages 182–199. Springer[Online]. Available:

  • Li Y, Lyu X, Koren N et al (2021) Neural attention distillation: erasing backdoor triggers from deep neural networks. In: 9th international conference on learning representations, ICLR 2021, Virtual Event, Austria, May 3–7, 2021.

  • Li Y, Lyu X, Koren N et al. (2021) Anti-backdoor learning: training clean models on poisoned data. In: Marc’Aurelio R, Alina B, Yann ND, Percy L, Jennifer WV (eds) Advances in neural information processing systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6–14, 2021, virtual, pages 14900–14912[Online]. Available:

  • Li Y, Lyu X, Koren N et al. (2021) Neural attention distillation: erasing backdoor triggers from deep neural networks. In: 9th international conference on learning representations, ICLR 2021, Virtual Event, Austria, May 3–7, 2021.[Online]. Available:

  • Miyato T, Maeda S, Koyama M, Ishii S (2019) Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE Trans Pattern Anal Mach Intell 41(8):1979–1993.

    Article  Google Scholar 

  • Mu B, Wang L, Niu Z (2022) Adversarial fine-tuning for backdoor defense: connect adversarial examples to triggered samples. CoRR

  • Nguyen TA, Tran A (2020) Input-aware dynamic backdoor attack. In: Hugo L, Marc’Aurelio R, Raia H, Maria-Florina B, Hsuan-Tien L (eds) Advances in neural information processing systems 33: annual conference on neural information processing systems 2020, NeurIPS 2020, December 6–12, 2020, virtual[Online].

  • Ren P, Xiao Y, Chang X et al (2022) A survey of deep active learning. ACM Comput Surv 54(9):1801–18040

    Article  Google Scholar 

  • Sajjadi M, Javanmardi M, Tasdizen T (2016) Regularization with stochastic transformations and perturbations for deep semi-supervised learning. In: Lee DD, Sugiyama M, von Luxburg U, Guyon I, Garnett R (eds) Advances in neural information processing systems 29: annual conference on neural information processing systems 2016, December 5–10, 2016, Barcelona, Spain,pages 1163–1171.

  • Salem A, Wen R, Backes M, Ma S, Zhang Y (2022) Dynamic backdoor attacks against machine learning models. In: 7th IEEE European symposium on security and privacy, EuroS &P 2022, Genoa, Italy, June 6-10, 2022, pp. 703–718. IEEE.

  • Sandler M, Howard AG, Zhu M, Zhmoginov A, Chen LC (2018) Inverted residuals and linear bottlenecks: mobile networks for classification, detection and segmentation. CoRR

  • Sikka K, Sur I, Jha S, Roy A, Divakaran A(2020) Detecting trojaned dnns using counterfactual attributions. CoRR

  • Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: Yoshua B, Yann L (eds) 3rd international conference on learning representations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015, Conference Track Proceedings.

  • Sohn K, Berthelot D, Carlini N et al. (2020) Fixmatch: simplifying semi-supervised learning with consistency and confidence. In Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin HT (eds) Advances in neural information processing systems 33: annual conference on neural information processing systems 2020, NeurIPS 2020, December 6-12, 2020, virtual. [Online]. Available:

  • Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: 2016 IEEE conference on computer vision and pattern recognition, CVPR 2016, Las Vegas, NV, USA, June 27–30, 2016, pp 2818–2826. IEEE Computer Society,

  • Tran B, Li J, Madry A (2018) Spectral signatures in backdoor attacks. In: Samy B, Hanna MW, Hugo L et al (eds) Advances in neural information processing systems 31: annual conference on neural information processing systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pp 8011–8021.

  • Turner A, Tsipras D, Madry A (2019) Label-consistent backdoor attacks. CoRR,

  • Vaswani A, Bengio S, Brevdo E et al. (2018) Tensor2tensor for neural machine translation. In: Colin C, Graham N (eds) Proceedings of the 13th conference of the association for machine translation in the Americas, AMTA 2018, Boston, MA, USA, March 17–21, 2018 - Volume 1: Research Papers, pp 193–199. Association for Machine Translation in the Americas [Online]. Available:

  • Wang A, Lu J, Cai J, Cham T-J, Wang G (2015) Large-margin multi-modal deep learning for RGB-D object recognition. IEEE Trans Multim 17(11):1887–1898.

    Article  Google Scholar 

  • Wu D, Wang Y (2021) Adversarial neuron pruning purifies backdoored deep models. In: Ranzato M, Beygelzimer A, Dauphin YN, Liang P, Vaughan JW (eds) Advances in neural information processing systems 34: annual conference on neural information processing systems 2021, NeurIPS 2021, December 6–14, 2021, virtual, pp 16913–16925 [Online]. Available:

  • Yang Y, Ma Z, Nie F, Chang X, Hauptmann AG (2015) Multi-class active learning by uncertainty sampling with diversity maximization. Int J Comput Vis 113(2):113–127

    Article  MathSciNet  Google Scholar 

  • Yoshida K, Fujino T(2020) Disabling backdoor and identifying poison data by using knowledge distillation in backdoor attacks on deep neural networks. In: Jay L, Xinming O (eds) AISec@CCS 2020: Proceedings of the 13th ACM workshop on artificial intelligence and security, virtual event, USA, 13 November pp. 117–127. ACM, 2020.

  • Zagoruyko S, Komodakis N (2016) Wide residual networks. In: Wilson RC, Hancock ER, Smith WAP (eds) Proceedings of the British Machine Vision Conference 2016, BMVC 2016, York, UK, September 19-22, 2016. BMVA Press[Online]. Available:

  • Zeng Y, Park W, Morley MZ, Jia R(2021) Rethinking the backdoor attacks’ triggers: a frequency perspective. In: 2021 IEEE/cvf international conference on computer vision, ICCV 2021, Montreal, QC, Canada, October 10–17, 2021, pp 16453–16461. IEEE.

  • Zhang Z, Geiger JT, Pohjalainen J (2018) Dep learning for environmentally robust speech recognition: an overview of recent developments. ACM Trans Intell Syst Technol 9(5):491–4928.

    Article  Google Scholar 

  • Zhang Q, Ding Y, Tian Y et al. (2021) Advdoor: adversarial backdoor attack of deep learning system. In: Cristian C, Xiangyu Z (eds) ISSTA ’21: 30th ACM SIGSOFT international symposium on software testing and analysis, virtual event, Denmark, July 11-17, 2021, pp 127–138. ACM.

  • Zhao P, Chen PY, Das P, Ramamurthy KN, Lin X (2020) Bridging mode connectivity in loss landscapes and adversarial robustness. In: 8th international conference on learning representations, ICLR 2020, Addis Ababa, Ethiopia, April 26–30, 2020.[Online].

  • Zhao F, Zhou L, Zhong Q, Lan R, Zhang LY (2022) Natural backdoor attacks on deep neural networks via raindrops. Security Commun Netw

Download references


We would like to show our gratitude to Yige Li for sharing his codes with us. And we thank “anonymous” reviewers for their insights.


This work was supported by the National Nature Science Foundation of China under Grant No. 62272007, National Nature Science Foundation of China under Grant No. U1936119 and Major Technology Program of Hainan, China (ZDKJ2019003).

Author information

Authors and Affiliations



The first author completed the main work of the paper and drafted the manuscript. The second author reviewed the manuscript and revising the article critically. He also proofread the manuscript and corrected the grammar mistakes.

Corresponding author

Correspondence to Bin Wu.

Ethics declarations

Competing interest

Both authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ying, Z., Wu, B. DLP: towards active defense against backdoor attacks with decoupled learning process. Cybersecurity 6, 9 (2023).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: