Towards the transferable audio adversarial attack via ensemble methods

Cybersecurity

Table 2 Transfer rate (TR) of adversarial samples on commercial cloud speech-to-text APIs

API service	Audio length	DS	ITRA	CS	DW	Occam	DC	RGE	DGWE
Aliyun-API	4s	2/50	0/50	9/50	15/50	50/50	–	35/50	20/50
Xfyun-API	4s	0/50	0/50	20/50	35/50	50/50	34.5/50	30/50	35/50
Baiduyun-API	4s	4/50	0/50	4/50	–	–	37.25/50	38/50	33/50
Tencent-API	4s	2/50	2/50	5/50	20/50	–	41.2/50	–	–
Microsoft-API	4s	15/50	7/50	15/50	40/50	50/50	–	–	–
Average	4s	4.6/50	1.4/50	10.6/50	27.5/50	50/50	37.7/50	34.3/50	28.3/50

The abbreviations DS, ITRA, CS, and DW refer to different attack methods presented in Carlini and Wagner (2018), Qin et al. (2019), Yuan et al. (2018), and Chen et al. (2020), respectively. Occam and DC refer to the attack methods proposed in Zheng et al. (2021) and Xu et al. (2022), respectively. The “-” indicates that the attack method has not been tested on the API. In the DW Chen et al. (2020), the authors evaluated 10 AEs. To match the number format of this work, we doubled the number. In the DC Xu et al. (2022), the authors reported the attack success rate as the result, which was converted by calculation to the format