Towards the universal defense for query-based audio adversarial attacks on speech recognition system

Recently, studies show that deep learning-based automatic speech recognition (ASR) systems are vulnerable to adversarial examples (AEs), which add a small amount of noise to the original audio examples. These AE attacks pose new challenges to deep learning security and have raised significant concerns about deploying ASR systems and devices. The existing defense methods are either limited in application or only defend on results, but not on process. In this work, we propose a novel method to infer the adversary intent and discover audio adversarial examples based on the AEs generation process. The insight of this method is based on the observation: many existing audio AE attacks utilize query-based methods, which means the adversary must send continuous and similar queries to target ASR models during the audio AE generation process. Inspired by this observation, We propose a memory mechanism by adopting audio fingerprint technology to analyze the similarity of the current query with a certain length of memory query. Thus, we can identify when a sequence of queries appears to be suspectable to generate audio AEs. Through extensive evaluation on four state-of-the-art audio AE attacks, we demonstrate that on average our defense identify the adversary’s intent with over 90%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$90\%$$\end{document} accuracy. With careful regard for robustness evaluations, we also analyze our proposed defense and its strength to withstand two adaptive attacks. Finally, our scheme is available out-of-the-box and directly compatible with any ensemble of ASR defense models to uncover audio AE attacks effectively without model retraining.


Introduction
Benefiting from the application of deep learning, the field of speech recognition has also been widely developed.However, deep learning-based automatic speech recognition (ASR) systems are shown to be vulnerable to audio adversarial examples (AEs), which add tiny perturbations on benign audio clips to fool the deep neural network model.Thus, how to secure ASR systems to prevent AE attacks remains a critical question.
Multiple mechanisms have been proposed to defend against audio AEs on ASR.Some methods mainly rely on signal processing skills such as smoothing, downsampling, reconstruction, and so on [1][2][3][4].These methods can destroy the adversarial components of AE to a certain extent, and prevent them from reaching the preset target to reduce their impact on ASR.But it also destroys the benign sample and works for defense against unknown attacks.There are some works that train an additional DNN network as a prior part of ASR [5][6][7].However, those defense methods depend heavily on the algorithms for generating AEs, the generalization capability is the key that limits the ability of defense, and the model will be difficult to discriminate the adversarial samples without participating in the training.In addition, the existing defense methods against audio adversarial examples focus on the generation results of AEs, without on the process.
We reinvestigate and rethink the process of generating the adversarial examples, trying to locate the "specific" features in this process.We also scrutinize the current state-of-the-art attacks, including white-box attacks [8][9][10], black-box attacks [11][12][13] and transfer attacks [14][15][16].We note that the perturbation of the AEs in some attacks is quite light, and the distance between them and the benign examples is small without a particularly significant difference.So it is difficult to identify whether a single input is an AE.We often ignore the process of AE generation and only pay attention to the results.How to utilize this discarded information.Yet, except for some attacks that directly generate AEs, the majority need to keep visiting the target model to adjust the AE, essentially stealing key information (e.g., gradients) from the model.In this case, the adversary needs to send massive and similar queries to the target model in a period, which likely exposure her adversarial behavior.Therefore, according to this feature, we do not try to discover individual inputs, rather we focus on the relationship between the inputs to recognize the attack.
In this work, we propose a universal and lightweight defense framework to infer the adversarial behavior by memory mechanism.The basic idea of our framework is that generating adversarial examples and the query to ASR models is continuous and correlated before and after.In contrast, a regular query is independent of others.We consider some history inputs of a certain length as a piece of memory, analyze the correlation between a new input and the memory, and mark the input as adversarial if the correlation crosses a certain threshold.We use the similarity of the audio fingerprint to estimate the correlation of the input.The insensitivity of the audio fingerprint to noise is an attractive trait.Meanwhile, since its simplicity, it is hard for the adversary to be aware of the use of defensive models.Furthermore, motivated by the similarity matrix for recommender systems, In this way, we can efficiently and quickly verify that the input query sound is adversarial or benign.We employ a non-neural network defense architecture and are not able to optimize the defense model in a similar way to a neural network, so an attacker may not be able to attack the defense model from that perspective.This strategy efficiently identifies the existing state-of-the-art adversarial sample attacks.The robust average uncovering success rates (DSR) are all above 90%.Also, our proposed framework can be easily combined with any other existing defense methods.
Finally, we study some adaptive attacks.We designed experiments with random noise attacks, which disturbed audio fingerprint feature extraction.For noise adaptive attacks, we observed that the modest level of random noise instead results in better performance to our defense system and we build a more robust defense system.In addition, we tested the potential role of different "fake query" ratios p f ake on the results.We conducted experiments on both types of adaptive attacks and proved that our defense framework remains robust under the damage.
The main contributions of this work are three-fold: • We propose a new defense mechanism for adversarial audio attacks by analyzing the correlation between input with memory.This is the first proposed defense framework based on the AEs generation process for the ASR.The robust average uncovering suc-cess rates are all beyond 90% for existing attacks and we first evaluated the music-based AEs.
• We demonstrate the robustness of our defense framework toward adaptive attacks.We found that the adaptive attack methods of fingerprint extraction damage and the "fake query" are unable to evade our defense, and our defense strategy is still effective.We build a more robust defense system through the combination of a moderate level of random noise.
• We designed a music-carrier dataset that can be used to produce audio adversarial examples, which also establish a foundation for future research on attacks and defenses based on music-carrier.And we release the source code for our defense and datasets at: https: //github.com/xxxx.

Background and Related Work
Adversarial Examples (AEs).Adversarial attacks originate from images and quickly develop, with much relevant research.Many works achieve successful attacks on image classifiers by the computed gradient and these attacks are relatively convenient to implement [17][18][19][20].Some work explores transfer attacks from white-box to black-box models but needs a lot of access to the target model [15,16,21].This provides a good reference for adversarial studies on audio.One may inquire about the reasons for the existence of adversarial examples.According to several works [22][23][24][25], they think that adversarial examples are not a network drawback but a feature.The network attempts to learn "all" the beneficial features during the training process, whereas humans are naturally inclined to ignore some features.When an adversary attacks the model via manipulation of such features, it leads to a rapid decrease in the accuracy of the model, whereas the accuracy of humans is immune.Thus our concern is not to remove the AEs and it fails to do so, instead, we should avoid the risk of the AEs to the model.
Audio Adversarial Attacks to ASR.A similar situation exists in the ASR.Typically, a state-of-theart ASR model is susceptible to deception by malicious AEs, which has evolved from a single-word attack to an attack on the entire sentence.Some state-of-the-art models were successfully attacked, [8] used CTC-loss to compute gradients to achieve an attack on Deep-Speech; CommanderSong [9] used pdf-id to design a loss function to implement attack base on Kaldi [1] ; [26] implemented an attack on Lingvo [2] with psychological masking.For black-box attacks, the gradient is incomputable.However, [27] successfully attacked the DeepSpeech black-box model with a genetic algorithm; [1] http://kaldi-asr.org. [2]https://github.com/tensorflow/lingvo[28] successfully attacked four commercial speech API services (Google Cloud Speech-to-Text, Microsoft Bing Speech Service, IBM Speech to Text, and Amazon Transcribe); [3] successfully attacked the speech recognition API interfaces of iFLYTEK and Ali with the co-evolutionary algorithm.Besides, already there are attacks that can be launched in the physical world.In order to enhance the robustness of physical attacks, in [27][28][29], the authors added the Gaussian white noise to AEs and the evaluation results show that this strategy enhances the physical robustness of the AEs.Although they do not require a specific noise model, they may rely on the playback device and the experimental environment.These attacks inevitably require a massive amount of queries to models, and query-based attacks are becoming worse with time.In this article, our main object of our article will be focused on recognizing such attacks before they succeed and defending against query-based adversarial attacks.
Defense against Audio Adversarial Attacks.The majority of proposed methods of defense against audio adversarial attacks are removing or ruining the adversarial component by the technical tool of signal processing.Paper [1] proposed random smoothing to mask the disturbing adversarial component.[2] proposed WaveGAN vocoder to reconstruct the waveform to eliminate the disturbing domain.[30] used label smoothing, [31,32] squeezed the audio, [3] is the down-sampling method and [4] added distorted signals.These works of defense are concerned with removing or ruining the perturbation component.Those approaches have both advantages and disadvantages, as it breaks the adversarial behavior of AEs while also causing a lot of damage to examples of benign queries.Deficiency of hard evidence for the difference between AEs and benign examples.Some people suggested applying sub-models to preclude some attacks [5,33].The literature [6,7,34] applies extra neural networks to check adversarial examples to protect the ASR model.But they can only restrain some existing attacks, which are impotent to uncertain attacks.The applications are limited due to the sub-models bulky.Some methods based on state detection of images [35,36] also provide some guidance for the audio adversarial attacks.Although these defensive works are available for certain types of attacks, it is a deficiency that the evaluation of adaptive attacks is incomplete or oversimplified.No integral architecture is available for combination with other methods.We work mainly on building a lightweight framework that can be easily combined with other defense methods.
Problem Setup.Hereafter, we concentrate on adversarial tasks.In a setup like this, the DNN network is represented as f , and f : X → C represents the given input x(x ∈ X) is mapped to one of a set of classes C, where f (x) = c ∈ C. The DNN model is vulnerable to adversarial input attacks, which forces the DNN model to misjudge.Attacks on DNNs can be classified as targeted and untargeted.Here, we will focus on the setting of targeted attacks.Specifically, adversarial examples x * are normally generated by slightly modifying x and x * = x + δ.The solve of δ can be converted to a min-optimization problem, i.e., arg min L(f (x + δ), c * ).The adversary's goal is to force f to misclassify x * as the target c * , i.e., f (x * ) = c * , c * = c.To ensure that x * is acoustically similar to x, the perturbation needs to be restricted to a limited range g(x * − x) ≤ ε, where the g is a measure function of the auditory difference.The attack process is shown in The correct transcription of x is "My friend, how are you", and the adversary's purpose is to add a careful perturbation "δ" to x and then make it become x * that can be transcribed as the target of "Call my wife".

Defense against Query-Based Audio Adversarial Attacks
A successful audio AE requires a specified carrier (the carrier can be music or dialogue) undergoing several iterations and queries.The process of AE generation is continuous.Every time, the adversary needs to produce a small disturbance δ to repeatedly adjust x * .When crossing the decision boundary, a successful AE is done and the whole process is depicted in Fig. 2. Our defense is motivated by the process nature of querybased attacks.We can examine the query-to-memory relationship to determine if queries are intended to generate an AE, which is the process-based defense approach.To calculate the correlation C of the new query about the memory, we used the similarity F of the audio fingerprint to estimate the correlation, i.e C(q memory , q new ) ≈ F (q memory , q new ).For each query, audio has unique fingerprint information.The audio fingerprint is robust to noise and adapts to a noisy environment.Moreover, it can prevent audio splicing attack [37].According to the obtained fingerprints, we can figure out the similarity between the input query and the memory, which provides the foundation for our determination.

Defense Architecture
Our defense architecture is a process-based defense approach and our goal is to find potential attacks in continuous queries.Suppose we have determined that the audio fingerprint similarity between the input query and memory is beyond the set threshold, we will report it as part of the attack sequence and take action accordingly.We can take some actions such as blacklisting the querying user or warning the user.Fig. 3 illustrates our scheme.
•Firstly, place query audio into the cache to form a query memory X of depth k.If the number of audio put into the cache is below k, consider all queries as a memory sequence.In the process of locating an attack, we expect to consume minimal resources and time, so k should not be too large.Also, it is disadvantageous to discover adversary behavior if k is too small.The k means the shortest depth before we can make sure that those input queries are intended to produce AEs.
•Secondly, calculating the fingerprints of all inputs in memory X and overwriting and updating the previous memory.
•Thirdly, for every new input audio, we calculate the weighted cosine similarity between the new input and each fingerprint in memory.Since audio fingerprint is a particular distribution about time and frequency, the cosine similarity can capture the correlation between such coordinate-dependent distributions.Besides, for each input, there is a necessity to check the legality, so we allocate a weight value α to each input with the Inverse Variance Coefficient Method [38].Then, calculate the similarity of the queries via: where x is the fingerprint of the new input, y i is a fingerprint in memory, and k is the depth of the memory X.The final similarity value s is the weighted average value of s i .The selection of the α i value is explained in the next section.
•Fourthly, obtain threshold δ, which implies minimal constraints regarding the input as malicious.When s > δ, it demonstrates that the current input is a potential attempt at generating an AE, and appropriate measures must be taken immediately.In practice, for the setting of δ, it is important to have a high uncovering success rate as well as a low false positive ratio.Usually, the false positive ratio will be limited to no more than 10% of the training data, according to the size of the training data set [35,39].The details of k and δ are explained in the next part of this section.

Memory Sequence
A memory sequence X consists of several queries that are placed in the cache.In the process of attack detection, we expect to consume minimal resources and time.SoX should not be too large.Also, it is disadvantageous to detect adversary behavior if X is too small.X of depth k means the shortest sequence before we are sure that those queries are intended to produce AEs, and the length of the sequence is k, i.e.
where f is the detection function, f (i) indicate whether the function f can detect a sequence of length i. Eq. 2 implies that the depth of 1, 2, ...k − 1 is not sufficient for X to be considered as the intention of generating AEs; depths k, k + 1, ...n are considered to be for the purpose of generating AEs, with the minimum depth is k.We explain how to choose the value of k in parameter selection.

Query Audio Fingerprints Similarity
The auditory similarity is an important feature in estimating the gap between humans and machines.There is a close auditory similarity between the malicious examples and the benign examples.The malicious examples are produced by appending carefully structured small perturbations to the benign carriers.Although the neural network regards them as two completely different classes, humans believe them as the same intuitively.So the trait of keeping intuitively consistent with humans is what we need.The audio fingerprint has this trait and is not sensitive as the DNN to perturbations.Fingerprints will maintain high similarity if humans believe they are the same samples.
It is possible to predict whether new input might have a strong correlation with the memory and whether they share the same behavioral attributes, according to the similarity computation between the preserved fingerprints and the new one.This is similar to the recommender system [40,41], which differentiates users based on their memory behaviors and recommends new content or products [42,43].
We note that the digital audio fingerprint [44,45] uniquely flags audio.The small noise of the audio doesn't bother the core information of the fingerprint.And it can defend against some attacks such as audio patching.Moreover, it is reliable and feasible in implementation cost to employ fingerprint similarity as an audio similarity.Fingerprint similarity relies on the following requirements, assume that s is the similarity function, x, y, z are three candidates in D dimensional space that satisfy Eq. 3, Eq. 4, Eq. 5, Eq. 6.
s(x, y) ≥ 0, (N on − negativity) s(x, y) = s(y, x).(Symmetry) s(x, y)+s(x, z) ≥ s(y, z).(T riangularinequality) ( A robust acoustic fingerprinting algorithm needs to consider the perception of the audio.When two audio files sound the same, their acoustic fingerprints should be the same or very close, even if there are some differences in their file data.
According to the literature [37,44].The fingerprint similarity can be divided into two steps: fingerprint extraction and similarity calculation.
Audio corresponds to a unique fingerprint, so the relationship between digital audio fingerprint F and audio object X is a surjection h : X − → F , and only •Fingerprint extraction (h : X − → F ).The fingerprint extraction process is illustrated in the fingerprint extraction module in Fig. 4. The main procedures include: 1) Preprocessing: it mainly involves frame split and filtering of the input data.
2) STFT: short-time Fourier transform.For each frame, apply STFT via Eq. 7, where x(t) is the input signal at time t, h(t − τ ) is the window function, and S(ω, τ ) shows the spectral result if the center of the window function is τ .
3) Find Peaks: after STFT, select the frequency peaks f and corresponding time t, and make sure the distribution of frequency peaks is uniform.
4) Pairs: pair the obtained frequency peaks f and time t, then the result {f, t} is used as fingerprints f i and f i is a high-dimensional vector of a certain length.
•Find Peaks.In Fig. 4, after calculating the STFT, we need to uniformly select the peak in the frequency domain.Eq. 8 describes this process, in which F (n, m) is the two-dimensional matrix after STFT, H(u, v) is the kernel function.Eq. 9 is the maximum filter and Eq. 10 is the high-pass filter for resetting the frequency to 0 when the frequency is below the cutoff D 0 .Both filters are useful for canceling low-frequency components and uniformly capturing the local maximum high frequencies.We choose the former as a tool to find peaks.
•Similarity calculation (g : F − → S).After fingerprint extraction, fingerprint f is obtained, which is written as x = f i .Similarly, another fingerprint can be written as y = f j and its length is the same as x.Then calculate the similarity s between them.The process is illustrated in the Similarity Module in Fig. 4. The fingerprint contains coordinate-dependent details.Finally, the similarity of x, y could be achieved by Eq. 1.

Parameter Selection
•The choice of k and δ.The larger the k value, the more effective our solution is in observing input queries, and the smaller the k value, the lower the computational cost.The k is the minimum depth of memory before we are sure that those inputs are intended to produce an AE.The δ is the minimum similarity before we determine that the current input is malicious.So the values of δ would be influenced by the depth of k.Specifically, establishing the threshold requires evaluating fingerprint similarities under the datasets, so that if the entire datasets were to be randomly streamed as queries, 0.1% of the carrier datasets would be marked as attacks.(In theory, the percentage of false positives should be limited to 10% of the dataset size, but since our dataset is small, our value is 100 times smaller than the default.)Actually, the threshold δ is a function of k, and Fig. 5 discloses their relation.The smaller the threshold δ, the more intense the constraints on the input.Hence small thresholds are advisable, but the too-small value risk regards a benign input as malicious.From what we observed from Fig. 5 with the increase of k, the similarity drops sharply in the beginning.(In turn, the distance rises, rapidly.The higher the similarity, the lower the degree of dissociation between input queries, i.e., the closer the distance.)After it reaches around k = 75, curves become smooth and increase modestly with k, and the process is quite gentle, so we set up k as 75 and the thresholds δ in both datasets are 0.313711 and 0.207398.•The choice of α.First, let's consider a case, in Eq. 11 below: There exist two memory sequences X, where memory X A consists of {q 0 ,q 1 ...q m ,q n } and X B is: {q 0 ,q 1 ...p,q n }, s 1 and s 2 are the similarity of the two sequences with new input, f is the fingerprint similarity function.The key distinguishing element between X A and X B is that the query q m differs from p. Assuming that p is a query deliberately placed in the queries by an adversary.The adversary's purpose of injecting p is to try to fabricate a fake input (i.e., almost irrelevant to the former) to confuse the analysis of the similarity and hide her intent.Essentially, both X A and X B are malicious memory sequences with only trivial disparity.But s 1 is below the threshold and s 2 is beyond the threshold, X A is decided as a potential attack while X B is not decided as a potential attack due to the injection of a fake query.We call this p-input as "fake query", and the ratio of "fake query" to all queries is called p f ake (p f ake = (p/k) * 100%).In our experiments, we found that the s value would change sharply when there were "fake queries" in the query memory and we employed the Inverse Variance Coefficient Method [38] to describe such fluctuations and disparities.According to this method, it is easy to determine the weights α, which are assigned as follows: where mena i depict the mean, std i depict the standard deviation, and s(l) depict getting the query vector of length l/2 before and after the i-query.For l, we set the maximum value as 7 (No more than 10% of the memory length, i.e. l max = f loor(0.1 * k) = f loor(0.1 * 75) = 7) and l begin with 2 (The mean and variance are worthwhile at least two values).Then, the value increases linearly.When it exceeds the maximum value, l shrinks to half of the original value and then increases linearly duplicate.Repeat this process until all elements are traversed.

Evaluation
In this section, we will show the evaluation results of our scheme for some non-adaptive attacks and adaptive attacks.We collected open-source code attacks as much as possible, and we did not evaluate attacks without open-source code, but we made some surveys about their details.Finally, we evaluated four class attacks that are well-known in the audio adversarial attack.Those are sufficiently representative and the bulk of the other work revolves around them.We evaluate the CommanderSong (CS) [9] attacks and the Devil's Whisper (DW) [28] attacks by applying the Music-set.The Mini-Librispeech dataset is applied to assess the IRTA [3] attack [26] and DS [4] attack [8].Those attacks all reported a success rate of attacks (SRoA) of almost 100%.Datasets [3] IRTA is an abbreviation for the attack of the paper "Imperceptible, Robust, and Targeted Adversarial Examples for Automatic Speech Recognition" [4] DS is an abbreviation for the attack of "Audio Adversarial Examples: Targeted Attacks on Speech-to-Text".This paper attacks the DeepSpeech model.
Our scheme conducts experiments on Mini-Librispeech [5] and Music-sets datasets (We build a carrier library of music-based samples containing 10,553 music clips.Appendix 1.1 contains all details about Music-sets).For Mini-Librispeech, this is a dialog-based dataset that some classic attack works rely on it and we cannot ignore it [11,12,27].For Music-sets, music has the characteristic of large-scale availability in most situations, and its accessibility and popularity allow it to become a candidate of the carrier in attacks.Lots of strong attacks refer to music as the necessary carrier for producing AEs.[3,[8][9][10]28] So, defense and evaluation of the AEs on musical carriers are inevitable and important.
Evaluation Metric •DSR.To evaluate the effectiveness of our approach for defending the query-based attacks, we employ the detection success rate (DSR) and First-Signal-to-Noise Ratio (F SN R) as the evaluation metrics.The detection success rate (DSR) is the most intuitive metric to evaluate the detection results.To calculate it as follows: where d n is the number of detections, a n is the number of queries, and k is the length of memory X. Obviously, the DSR value is below 1 because a n > d n * k is clear.The detection occurs after performing at least one query.For our purposes, we consider it to measure the probability of finding adversary behavior.A higher DSR is preferable.
•F SN R .The First-Signal-to-Noise Ratio (F SN R) is a function that defines the minimum SN R to detect an attack, i.e., how much SN R when we can detect the attack, as shown in Eq. 14: where x is the original sound, δ is the perturbation, A x is the amplitude of the original sound, and F A δ is the amplitude of the perturbation when the first attack is detected.This is a metric of the relative value of distortion of the AE vs the original sound.The higher F SN R describes that the query will be regarded as a suspect under a smaller perturbation.

Non-adaptive Attack Evaluation
We evaluate four class attacks that are well-known in the audio attack.Those are sufficiently representative and the bulk of the other work revolves around them.We evaluate the CommanderSong (CS) [9] attack and the Devil's Whisper (DW) [28] attacks by [5] https://www.openslr.org/31/Tab.1: Non-adaptive attack evaluation.SRoA denotes the success rate of attack.The higher the value of DSR and F SN R, the more beneficial.Normally, every k (k=75) query is detected once, and if the queries are less than k, at least one detection is performed for all n queries, and the ratio of n/k is the detections.applying the Music-set.The Mini-Librispeech dataset is applied to assess the IRTA attack [26] and DS attack [8].Those attacks all reported a success rate of attacks (SRoA) of almost 100%.CS attack is the representation of employing music as carriers and some subsequent work [3,28] set it as an indispensable collection.The DW attack is the typical instance for commercial black-box APIs.Subsequently, much of the work [12,13] on black-box attacks has to test on APIs.IRTA attack based on the psychoacoustic hiding model is an outstanding work of the period.And several studies [10,46] adopted the psychological masking effect.DS attack is the earliest version of voice attack, which launched the gateway to voice attack and provided a reliable infrastructure for the subsequent works.
•N1.CS attack Evaluation.CS attack is a whitebox attack by injecting target commands into the song.It started a precedent of producing AEs with music as a carrier and achieving a 100% success rate of attacks (SRoA) on the Kaldi speech recognition system.It has a profound influence, and many follow-up works set it as an indispensable reference.For the defense based on our approach, there are few blanks in the music, the spectrum is abundant, and the fingerprints are often more reliable than those of the dialogue version.Tab. 1 shows that CS examples spend an average of about 300 visits to the target model.Our security architecture can accurately detect such attacks with DSR up to 98%.However, the value of F SN R is only 7.38 dB, revealing that the AEs were already very noisy when we suspected the query was an attack.The primary factors of this situation are that the small perturbation is not ideal for a CS attack and the perturbation is constrained to a very broad range.Therefore, the amount of additional noise is significant.Apart from that, various audio lengths will affect the SRoA of AE.To ensure the validity of AE, the length of audio ought to be no shorter than 4s.The longer the audio, the richer the fingerprint, which is more helpful for detection.However, the shorter audio is not beneficial for the adversary to generate AEs successfully.
•N2.DW attack Evaluation.DW attack first accomplished a black-box attack on commercial speech recognition APIs (including Google Assistant, Google Home, Amazon Echo, and Microsoft Corina).Since then, attacks on APIs have gradually become a necessary option for black-box attacks and the most intuitive indicator of the attack algorithm.Tab. 1 shows that DW also works based on the music dataset, which accounts for 50% of CS in the average query to the target model and SRoA is close to 98%.On defense, our approach enables a DSR of 84.74% under DW attack.DW attack employs a local substitution model to simulate approximately the target model of the APIs ASR system.It helps to diminish the number of queries and the likelihood of triggering detection.So DSR possible losses.The F SN R value is 18.41dB, which is about 2.5 times that of CS.DW increases the F SN R value by reducing the number of visits to the model, and the perturbation naturally decreases.DW adopts Noise Model to augment the physical robustness of AEs.However, the SRoA is deeply relevant to the environment and the device.Regarding the noise model, the combination of our scheme with some straightforward measures (e.g., down-sampling, filtering) can raise the level of difficulty of physical attack.
•N3.IRTA attack Evaluation.IRTA attack is a two-stage attack algorithm on Lingvo, concealing target commands to a space that the human ear cannot hear through a psychoacoustic masking model.The IRTA example is based on the open-source dataset Librispeech.This type of dialogue audio contains a large number of silent fragments.Therefore, the fingerprint of the audio is inferior to that of the music.But the inspiring thing is that our approach maintains a robust attack detection and that the DSR reaches 84%.This can be attributed to the time cost of this type of attack (Producing a successful adversarial example costs 24.8h) leads to a remarkable number of queries.Such massive queries easily provoke the inspection of the defense system.Moreover, the perturbation is very small, Tab.2: An overview of the query-based attacks against ASR.Note: in the table, "GD", "GA", "GE", "SGE" represent the Gradient Descent, Genetic Algorithm, Gradient Estimation, and Selective Gradient Estimation."Alt-M", "Psy-M", "Co-E", "PSO", "Mul-Obj GO" represent the Alternative Models, Psychoacoustic Masking, Co-evolutionary algorithm, Particle Swarm Optimization, Multi-Objective Genetic Optimization."M or D" represents the Music-carrier or Dialogue-carrier, "-" denotes the author didn't show, and "*" denotes the author told us the WER of the attack model to AEs was increased to 980%.

D -*
and the F SN R can reach 40.97dB in which the psychoacoustic masking model plays an important role.Still, the perturbation would reflect the frequency domain and the fingerprint extraction happens in the frequency domain.We can further presume that it will be costly to bypass our defenses for adversaries with an emphasis on hidden perturbation via psychoacoustic masking.Nevertheless, it also exposes a critical concern: In the areas that humans fail to hear, is there a necessity for the machine to do so?AI researchers aim to narrow the gap between humans and machines, so machines should also appear human-like for regions beyond human perception.Blocking such attacks implies that the machine does not have the power to do anything in the regions where humans are unable to perceive, thus, the attack will completely dissolve.
•N4.DS attack Evaluation.DS attack is a type of attack first implemented on DeepSpeech.At its core is to optimize the CTC-Loss function.Compared to IRTA attacks, DS is relatively heavily perturbed that maybe without applying the theory of psychological masking, and relatively poorer F SN R but DSR is 82.5% closer to IRTA.Compared to CS and DW attacks, DS and IRTA attack are implemented on Librispeech containing rare fingerprint information, so DSR is inferior to CS and DW.Nevertheless, the general F SN R is superior to the former, showing the method's detection capability to attacks with small perturbations.Separate work deploys genetic algorithms and gradient estimation to generate adversarial samples.However, gradient estimation relies on the sampling theory.Biological evolutionary algorithms demand substantial expenses without the guideline of the gradient.The literature [27] queries numbers up to 1000+, and the literature [3] reach a stunning 30000+.From Tab. 1, it has a remarkably higher detection rate for query numbers above 1000+.Multiple query numbers are an obvious disadvantage of the evolutionary algorithm.Unless improving this shortcoming, do not expect to evade our inspection.
We investigated the perturbation level of AEs so that we can easily compare them with F SN R, as shown in Tab. 4.
•N5.Other query-based attacks evaluation.Other query-based attacks, the majority of them are based on the 4 attacks above.CS attack is the representation of employing music as the carrier.After that, subsequent work [3,28] also set it as an indispensable collection.The DW attack is a typical example of attacking commercial black-box APIs.Subsequently, a lot of the work [12,13] on black-box attacks has to be tested on APIs.IRTA attack based on the psychoacoustic hiding model is an outstanding work of the period.Several studies [10,46] adopted the psychological masking effect.Literature [47,48] using biological evolutionary algorithms to perform attacks and optimize the number of queries.DS attack is the earliest relatively sophisticated version of an audio attack, which provides a reliable infrastructure for subsequent works.Since our defense framework is process-based, we were unable to evaluate the attacks without open-source code but still surveyed them.More relevant details are provided in Tab. 2.
We can learn from the above that applying a music carrier is quite advantageous for detection, also the detection is significant when the number of queries is numerous.The critical factor is that the fingerprints of music are more obtuse to perturbations, while the conversational ones are not.In terms of fingerprint extraction, Fig. 9 from the Appendix 2 supports similar results.In the following, we built a more robust defense system that raises the average DSR beyond 90% and substantially strengthens our defense, Tab. 5 shows the results.For adversaries, unless improving those shortcomings, do not expect to evade our inspection.Below, we propose a more robust defense system by combining other methods, which can achieve a detection ratio of over 90%, The details are in .
Adaptive Attack Evaluation Whereas our defense framework can effectively detect existing attacks, it only assures in "zeroknowledge" attack scenarios where the attacker is unknown of the existence of the defense framework.In order to reliably implement our framework in practice, we have to assess adaptive adversaries who understand the defense details entirely and intend to deploy some strategies to bypass the defense mechanism.Following the guidelines of [49], we designed adaptive attacks to evaluate the ability of our defense to adaptive attacks.According to the defense details we consider both adaptive attacks: Random Noise attack and Proportion of Fake Queries attack .p fake (%)
Fig. 6 shows the effect of p f ake on DSR.
•A1.Random noise attack.We conceive an adaptive attack of corrupting fingerprint extraction.Randomly insert noise with different SNR to the audio in the process of query.Forcing the x * to bypass the defense, and successfully attack the ASR, and the perturbation is not easily perceived by the human.In Fig. 7, according to audio quality theory, when SN R is above 70, it belongs to high-fidelity quality audio.When SN R = 0, the noise has the same energetic value as the original audio, so when SN R is below 0, the original audio is almost flooded with noise.
As shown in Fig. 7, for CS and DW attacks, when Noise-SNR is below 0, the SRoA is also nearly 0. Therefore the malicious queries are almost unsuccessful in attacking the ASR system, which is unacceptable for the adversaries.When Noise-SNR>0, the SRoA and DSR are rapidly recovering to their maximum value and keep it and the SRoA, in other words, DSR displays a comparable consistency.Though large noise decreases the DSR value but also decreases SRoA, which diverts from the adversary's target.So it is impossible to achieve superior SRoA while trying to break our defense.However, when the Noise-SNR value gradually increases, for IRTA and DS attacks, SRoA is rapidly recovering to its maximum value and keeping it except IRTA attack recovery is slower and the DSR value sharply rises and then gently drops until it becomes peaceful.Since Mini-Librispeech is a dialogue-based dataset and it contains a lot of blank frames, when inserting noise, it will fill the blank and become more helpful to the extraction of fingerprints.It can be deduced that joining appropriate noise can improve the robustness to our method.The query of containing noise does not undermine our defenses, on the contrary, it leads the defense system more sensitive and robust.
Tab. 3: DSR as a function of the p f ake .The effect of different false query ratios on the success rate of detection.•A2.Proportion of Fake Queries Attack.Moreover, we noted above that some adversaries use "fake queries" to develop a fake query history.In this section, we evaluate the impact on the defense system for different proportions of "fake queries" (p f ake ).Tab. 3 plots the results.It also can be intuitively understood from Fig. 6.As observed, there is a critical threshold p f ake for the defender: once p f ake exceeds this threshold, the DSR drops dramatically.For these attacks, if p f ake ≥ 60%, DSR drops to approximately 10% or 0%.For CS and DW attacks, the DSR linearly dropped when p f ake ∈ [25,50].However, for the other two attacks, this situation does not happen.An intuitive explanation of this can be as follows: p f ake mainly affects the estimation of the query of interest for defense; yet, the priority of our defense is to distinguish the authenticity of the query, p f ake tends to have a larger impact on our proposal.The adversary's strategy to evade detection would probably be to set up p f ake to a sufficiently high value (e.g., p f ake ≥ 60%), but this would dramatically raise the cost of the attack and the number of queries.This makes the attacker overwhelmed and they are not sure if they can obtain AEs to attack target models successfully.•A3.Other Adaptive Attacks.EOT is a well-known attack on images [50].However, in audio attacks, after testing EOT transformations in audio (including waveform shifting, volume up/down, Pitch Shifting, Frequency Mask [51], SpecSwap, etc), we found that EOT transformations could play the role of enlarging the datasets but has no significant effect on ASR results.We suspect that this is attributed to the timeseries correlation between the before and after of audio data, so some simple transformations cannot impact ASR.Therefore it is difficult to perform an adaptive Tab.5: Robust defense: we add noise based on different SN R, the lower the SNR, the heavier the added noise.attack similar to the images.We evaluated the two likeliest adaptive attacks, and of course, it is also possible to design the other attack according to the details of the defense, but probably without significant impact.

Robust Defense
In the random noise adaptive attack and Fig. 6, we found that the appropriate level of noise could help us build a more robust defense system, so we further studied the subtle relationship.In Tab. 5, we set up six different noise levels.The audio belongs to highfidelity quality audio when SN R > 75 and the noise is extremely slight.Once the noise gradually rises to SN R = 75, our defense system can achieve more than 90% detection success rate for all attacks; when the noise rises to SN R = 50, the detection success rate reaches the maximum (and the average is 93.69%).The noise SN R < 25, the noise has become significant, exceeds the threshold, and the detection success rate drops.So, with the noise SN R ∈ [25,75], we can build a more robust defense system and achieve a detection ratio of over 90%.Besides, our experiments also proved that the small input noise has a defense effect [52].

Conclusion and Discussion
In this work, we analyze adversary behavior during AE generation and detect potential attacks based on the association before and after the query.Our focus is on detecting the AE generation process, which provides a novel approach to process-based defense.Our approach achieves average detection success rate of over 90%.It is a lightweight framework that is both quick and efficient, able to be closely combined with other defenses to build the foundation for a structured defense system.
However, with more research on attacks, single-step generation attack of AEs is growing, which impose higher requirements to the defense.From another aspect, our scheme increases the attacker's cost of attack, and our scheme will be fooled if the attacker has a large number of resources.Fingerprint fraud techniques can also create vulnerabilities in our approach.In addition, some adversaries may give up their attacks on the target system and turn to attack the defense system, which also warrants our attention.

Music-sets
We contacted the authors of CommanderSong [9] and Devil's Whisper [28] to consult them on the details about how to design the music-based carries for the adversarial samples (AEs) they used in their experiments, and obtained a copy of the original music dataset they applied.To evaluate the threshold, we created a music carrier dataset for making AEs based on the obtained original music dataset.We have released the processed dataset and you can get our data from: https://drive.google.com/file/d/1wPVK9S8TyB0aaXqXFKEebYKuKshmBvDc/view.
The original music dataset is a raw dataset of 100 songs collected on YouTube, including pop, classical, rock and light music, and ranging across multiple languages, including Korean, English, Japanese, Chinese, Russian and Arabic.The length of each song is about 5 minutes.
In our experiments, we studied the impact of different audio lengths on AEs and found that different lengths of audio affect the generation of adversarial examples.Overly short audio decreases the success rate of attacks, and too long audio increases the cost of producing AEs.Only properly lengthy audio is a candidate for AEs.We use Word Error Rates (WER) to research this issue.
In Eq. 15, S represents the number of characters replaced, D represents the number of characters deleted, I represents the number of characters inserted, and N represents the total number of characters.From Fig. 8, it can be seen that the WER changes with the length of the audio.If the audio length is below 3.19s, the attack success rate of the AEs decreases as the audio length reduces (the WER of the target  command increases).Above this value, the attack success rate reaches 100% and the WER falls to 0%.However, the time cost of producing an AE increases linearly with the length of the audio.The longer the audio, the higher the cost of producing AEs.While the audio length is 3s-4s, the most excellent performance is obtained and the ratio of time cost to WER is the lowest.Finally, the recommended audio length is 3s or 4s by balancing time and word error rate.During the production of our dataset, we divided each audio data into 3s and 4s to balance the success rate of the attack and the cost.
To simulate disturbances and improve the noise immunity of the audio, we must insert some noise into the clean dataset.Our experiments showed that when music develops as the carrier, the inserted noise is within 8000 (randomly insert), and the similarity distribution is in [0.36, 1].The noise does not influence people's auditory perception, and the primary information of the audio remains reachable.So we keep the randomly inserted noise to the audio below 8000.When clipping music, the length of each slice is limited to 4s according to the principle of random slice.For each song, segment 25 slices at a time, 5 times in total.Finally, obtaining 5 * 25 * 100 = 12, 500 slices.After that, the noise is randomly inserted into some of these slices by randomly displacing the sequence.After testing each slice, there were 10553 of qualified slices obtained in total.Storage space occupied nearly 1.3G.
Currently, in the field of audio adversarial attacks, no publicly available dataset is based on music, except for some are dialogue-based which as a carrier for AEs.Instead, music is becoming a necessary candidate for attacks due to some of its advantages, but lack of proper datasets.To alleviate this problem, we are happy to share our data with the researcher community so that they can develop more research on music-based attacks and defenses.We also welcome interested researchers to expand the dataset with us.

Mini LibriSpeech
For the Mini-LibriSpeech dataset, we used FFmpeg [6]  to convert from flac to wav.According to Fig. 8, we removed some samples that were either overly short or overly long, and we suggested recalculating the threshold to ensure that the detection was not affected once the dataset was modified.You can download the training data set from https://www.openslr.org/resources/31/train-clean-5.tar.gz.

Benign examples and AEs Audio Fingerprint
As shown in Fig. 9, through the addition of perturbations (i.e., noise) on the clean carriers audio to generate AEs, the music-based ones have relatively more and richer fingerprints than the dialogue-based ones, which also confirms that the music-based AEs are easier to detect by our scheme.We also observed that the fingerprint difference between AEs and carriers is small.The fingerprint of each query is similar and the calculated similarity between the queries is very high if the carrier intends to generate AEs.This further proves the viability of our scheme.

Societal Impacts
For the attacks that require querying the ASR model, much of the defense work was mainly concentrated on the processing of inputs to achieve the defense purpose.Only considering the examination of individual inputs, it lost the procedure information and the results are often not reliable.Our scheme, on the other hand, involves considering the totality and continuity of inputs and capturing the neglected information, which can help us better track the adversary behaviors and make an accurate diagnosis.Such a strategy is more consistent with sociology as well.Meanwhile, dialogue-based carriers have lots of limitations in practical applications and it's hard to reproduce in real attack scenarios, which are gradually abandoned by researchers.Music-based AEs are gradually becoming [6] https://github.com/FFmpeg/FFmpeg the mainstream of attacks.The music is easily reproduced in the actual attack scenarios.The danger is very significant if music is hijacked as AEs, which cannot be ignored by researchers.However, the existing evaluation of defense work is still focused on the evaluation of public dialogue datasets.Lack of evaluation of music-based datasets for defense.In our paper, we have comprehensively evaluated the AEs with musicbased carriers, which has a large social impact and also lays a solid foundation for related works in the future.

Fig. 1 .
Fig.1: The correct transcription of x is "My friend, how are you", and the adversary's purpose is to add a careful perturbation "δ" to x and then make it become x * that can be transcribed as the target of "Call my wife".

Fig. 2 :
Fig.2: Query-based attack: setting a target, for the first time x * =x, if x * can be transcribed as a target, the AE is true, else false, adjust the δ carefully, and perform the next query.Repeat this process until x * can be transcribed as the target.

Fig. 5 :
Fig. 5: k and δ: the mean k number of the 0.1% percentile of the datasets as a function of k.

Fig. 7 :
Fig. 7: Adaptive attack: Different noise-snr to disturb the extraction of fingerprints.Noise-SNR indicates the noise of different SNR.The smaller Noise-SNR means higher noise level.

Fig. 8 :
Fig. 8: Audio length impacts the production time of AEs and the integrity of command.

Fig. 9 :
Fig. 9: Clean audio and the AEs fingerprint on Dialogue-carrier and Music-carrier.