Efficient electro-magnetic analysis of a GPU bitsliced AES implementation

The advent of CUDA-enabled GPU makes it possible to provide cloud applications with high-performance data security services. Unfortunately, recent studies have shown that GPU-based applications are also susceptible to side-channel attacks. These published work studied the side-channel vulnerabilities of GPU-based AES implementations by taking the advantage of the cache sharing among multiple threads or high parallelism of GPUs. Therefore, for GPU-based bitsliced cryptographic implementations, which are immune to the cache-based attacks referred to above, only a power analysis method based on the high-parallelism of GPUs may be effective. However, the leakage model used in the power analysis is not efficient at all in practice. In light of this, we investigate electro-magnetic (EM) side-channel vulnerabilities of a GPU-based bitsliced AES implementation from the perspective of bit-level parallelism and thread-level parallelism in order to make the best of the localization effect of EM leakage with parallelism. Specifically, we propose efficient multi-bit and multi-thread combinational analysis techniques based on the intrinsic properties of bitsliced ciphers and the effect of multi-thread parallelism of GPUs, respectively. The experimental result shows that the proposed combinational analysis methods perform better than non-combinational and intuitive ones. Our research suggests that multi-thread leakages can be used to improve attacks if the multi-thread leakages are not synchronous in the time domain.


Introduction
Nowadays as the most widely used parallel computing platform, Graphics Processing Unit (GPU) has evolved from a special hardware for graphics rendering into a general-purpose computing device for various applications as biomedical analysis, signal processing, scientific computing and so on. GPU executes program in a Single-Instruction, Multiple-Thread (SIMT) fashion, so it is well suited for cryptographic applications deployed in cloud computing environment to provide the Securityas-a-Service (SECaaS). Unfortunately, GPU-based applications are vulnerable to many known attacks as proposed in Di et al. (2016); Naghibijouybari et al. (2018); Jiang et al. (2016). Among those published vulnerabilities *Correspondence: zhouyongbin@iie.ac.cn 1 State Key Laboratory of Information Security, Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China 2 School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China of GPUs, side-channel vulnerabilities are the most serious ones due to their non-invasiveness to target devices. In recent years, the study on the side-channel attacks against cryptographic implementations have always been a research hotspot of cryptanalysis beyond algebraic analysis methods. As the most popular block cipher, AES has been widely deployed on a variety of hardware platforms. The side-channel attacks against CPUbased AES software implementations and FPGA-based hardware implementations have been deeply investigated. Until very recently, some literatures mentioned that GPUbased cryptographic implementations are also susceptible to side-channel attacks through electro-magnetic (EM) emanation ), power consumption (Luo et al. 2015) or execution time leakages (Jiang et al. 2016;. Thanks to the bitsliced cipher proposed by Biham Biham (1997) as well as the efficient GPU-based bitsliced AES implementations proposed by Lim et al. Lim et al. (2016) and Nishikawa et al. Nishikawa et al. (2017), we are capable of deploying GPU-based AES cryptosystems that are resistant to cache-timing attacks (Jiang et al. 2016; and cache-based EM attacks ). However, it does not mean that GPU-based AES implementation is not vulnerable to other side-channel attacks, for example the power analysis attack proposed in (Luo et al. 2015), though it is not efficient at all in practice. In light of this, we study more efficient side-channel attacks against a GPU-based bitsliced AES implementation in order to give a deeper insight into the side-channel vulnerabilities of GPU-based cryptographic implementations in the perspective of parallelism.

Related work
Luo et al. proposed the first power analysis attack against a GPU-based AES implementation in (Luo et al. 2015). They inserted a resistor in series with power supply in order to measure the power consumption of GPU card. They targeted a T-table-based GPU AES implementation and built a simplified leakage model to avoid the synchronization of power traces in the time domain. They employed correlation power analysis (CPA) to recover 16-byte secret key of AES with 160,000 power traces. Their attack is performed in a chosen-thread mode, which requires the adversary be capable of encrypting the same plaintexts for all block threads. In fact, it is almost impossible to conduct side-channel attacks successfully in known-plaintext or highly-occupied scenarios against GPU-based cryptographic implementations. After that Jiang et al. proposed two cache-based timing attacks against T-table-based GPU AES implementation based on the time differences induced by L1 cache line access serialization (Jiang et al. 2016) and shared memory bank conflict (Jiang et al. 2017). They recovered the 16-byte secret key of a GPUbased AES implementation by correlation timing analysis and differential timing analysis, respectively. Very recently, Gao et al. proposed electro-magnetic analysis attacks against a GPU-based AES implementation based on the cache line access coalescing , which are proved to be much efficient.

Contributions
To the best of our knowledge, this is the first work that investigates EM side-channel vulnerabilities of GPUbased bitsliced cryptographic implementation in the perspective of both bit-level parallelism and thread-level parallelism. The contributions are as follows: First, we study the vulnerabilities of bit-level parallelism in a GPU-based bitsliced AES implementation. With the help of a multi-bit EM leakage features extracted from non-profiled t-test, we construct two special multibit combinational analysis methods, namely multi-bit feature combinational analysis and multi-bit decision combinational analysis, which take the full advantage of the bit-level parallelism of bitsliced ciphers and are experimentally proved to be more efficient than traditional single-bit CEMA.
Second, we also study the vulnerabilities of thread-level parallelism in the same AES implementation referred to above. In the study, a profiled correlation-based leakage detection test is employed to extract a multi-thread EM leakage feature, which is later used to construct special multi-thread combinational analysis methods. The proposed methods make the best of the thread-level parallelism of GPUs and our experimental result shows that the proposed combinational methods outperform traditional non-combinational ones.

Organization
The rest of this paper is organized as follows. In "Preliminary" section, we give a brief introduction of CUDA-enabled GPUs, a GPU-based bitsliced AES implementation, and definitions and notations involved in this paper. In "EM measurement" section, we present special techniques for leakage acquisition and preprocessing. In "The vulnerabilities of bit-level parallelism" and "The vulnerabilities of thread-level parallelism" sections, we investigate the side-channel vulnerabilities of bit-level parallelism and thread-level parallelism of the GPU-based bitsliced AES implementation, respectively. Finally, conclusions are given in "Conclusions" section.

Preliminary
In this section, we give a brief introduction to the architecture of CUDA-enabled GPUs, the features of GPU-based bitsliced AES implementation as well as the definitions and notations involved in this paper.

CUDA-enabled GPU
Compute Unified Device Architecture (CUDA) is a general-purpose parallel computing framework and programming model developed by NVIDIA for its GPUs. In a physical view, the CUDA-enabled GPU is composed of M× Streaming Multiprocessors (SM) and a global memory. Each SM has N× Scalar Processor (SP), a shared memory, several 32-bit registers, and a shared instruction unit. In an abstract view, CUDA defines the threading model, calling conventions and memory hierarchy for programmers.
Warps are the basic unit of execution in an SM. When you launch a grid of thread blocks, the thread blocks in the gird are distributed among SMs. Once a thread block is scheduled to an SM, threads in the thread block are further partitioned into warps. A warp consists of 32 consecutive threads and all threads in a warp are executed in SIMT fashion; that is, all threads execute the same instruction, and each thread carries out that operation on its own private data.

GPU-based bitsliced AES implementation
The terminology bitsliced cipher was first proposed by Biham Biham et al. (1998) referring to the AES candidate Serpent. Precisely speaking, bitsliced cipher is a concept about cryptographic implementation instead of cryptographic algorithm or scheme itself.
The AES implementation of bitsliced version could process more than one 128-bit plaintext in a parallel fashion. The parallelism is determined by the word-length of a processor. For 32-bit processors, 32 128-bit plaintexts can be encrypted in parallel, which is also mentioned as bit-level parallelism. The first step of a bitsliced AES implementation is to transpose multiple plaintexts by bit in order to adapt bitsliced execution fashion. As showed in Fig. 1, 32 128-bit plaintexts are arranged by row, and each plaintext is written to or read from four 32-bit registers within GPU. The 32×128 matrix is transposed before the first round encryption, and the inverse transposition is performed after the final round encryption. Obviously, only one forward transposition and one inverse transposition are needed to finish one bitsliced AES encryption on a single GPU thread. For multiple thread executions on GPU, each thread executes the above process independently, which is also referred to as thread-level parallelism. In a word, GPU-based bitsliced AES implementation achieves parallelism in two dimensions, namely bit-level parallelism and thread-level parallelism.

Definitions and notations
Definition 1 For a bitsliced AES implementation on a 32-bit processor, 32 16-byte standard AES state is mapped as Fig. 1 to an (8 × 16)-sized matrix: Definition 3 For a CTSCA to an (m×n)-thread parallel cryptographic implementation, if an attacker is able to assign the same random plaintext to every consecutive m threads, the CTSCA is called n-group multi-thread CTSCA or TnG CTSCA for short, and the chosen-thread encryption is called n-group multi-thread encryption or TnG encryption for short. Fig. 1 Transposition in bitsliced AES implementation. On the left side, it shows the normal byte ordering of 32 128-bit plaintext block layout in registers. On the right side, it is the bit-oriented transposed bitsliced ordering layout in registers. The transformation from the left to the right and the reverse are made only once at the beginning of encryption and at the end of encryption, respectively Definition 4 For a side-channel attack to a bitsliced cipher implemented on an m×n-bit word-length processor, if an attacker is able to give the same plaintext for every consecutive m slices, the attack is called n-group multislice CSSCA or SnG CSSCA for short, and the chosen-slice encryption is called n-group multi-slice encryption or SnG encryption for short.

Notations.
In this paper, · and denote quantifier all and any, respectively. For example, if M is a 5 × 10 matrix, then M[ 2, ·] denotes the 10 elements of the second row of the M, which is also a 10-dimensional vector, and M[ , 3] denotes any one of the 5 elements in the third column of the M, which is also regarded as a set composed of 5 elements.

EM measurement
Electro-magnetic emanation around electronic devices can be captured without any difficulties, but it is not so easy to measure useful signals. Compared with power analysis, EM analysis enables us to take the advantage of localization effects, which makes EM attacks more efficient than power analysis attacks. We use two small magnetic probes Rohde Schwarz RF B 3-3 and Rohde Schwarz RS H 2.5-2 instead of larger ones in order to probe localized leakages from near-field emanation (Agrawal et al. 2002). Theoretically, the region located less than 1/2π of wavelength away from the source is called near-field. All our probings in this work are conducted in this region.

Set-ups
The testbed in this work is set up with the following configurations: -We target an NVIDIA's GeForce GT 620 graphics card connected to the host with PCI-e bus. Though the device is of low performance, it is enough to show the vulnerability of NVIDIA's GPU to EM attacks. Specifically, the device has one streaming multiprocessor of 48 SPs, an L2 cache of 64KiB, and it is equipped with an off-chip device memory of 454MiB. The device is running at 1.27GiHz. -We port a bitsliced AES implementation of an open source community (Patrick) into the GPU. Since it is a table-free implementation, we do not need to consider the efficiency of table look-up with respect to different types of memory. The device memory in our GPU is used to store the plaintexts to be encrypted as well as the ciphertexts to be produced.
-We employ an Agilent DSO9104A digital oscilloscope, which is capable of measuring signals with a sampling rate up to 20GHz (20GSa/s). We set our sampling rate as 200MSa/s, which turns out to be enough for our experiments.
Our testbed is set up in a client/server mode which is widely used for internet applications. Specifically, in cloud computing environment, cloud devices that provide SECaaS work as servers, and inside attackers (Duncan et al. 2015) are authorized to encrypt any plaintexts P-s then obtain the corresponding ciphertexts C-s and measured EM traces T -s. With a sufficient number of triples P, C, T , the attackers attempt to recover the preset secret key of our GPU bitsliced AES implementation.
Locate Signals A printed circuit board such as GPU card is usually composed of hundreds of electronic components like chips, resistors, capacitors, inductors and so on. However, it is not necessary to check all of them to locate target signals. Generally speaking, only the right above of GPU chip and the capacitors on the back of GPU chip should be checked, because these positions or components tend to produce useful leakages, which is also confirmed afterwards in our experiments. More specifically, we start up the CUDA program and run the encryption procedure in a loop. We adjust EM probe on the candidate components within their near-field zones until we find a position in which the oscilloscope captures a periodic signal. If some patterns within the signal repeat nine to ten times, leakage positions are found. A repeated signal in our experiments is showed in Fig. 2. We call it target signal.
Collect Signals Although the target signal is identified, it is still not easy to capture it without external triggers. In fact, it is impractical to provide an external trigger controlled within program, so we design a delicate trigger with another magnetic probe. As shown in Fig. 2, two signals measured at different probing positions look similar, and the amplitude in the upper one is basically less than that in the lower one. However, the two signals share a signal pattern of the same high voltage marked as Trigger A and Trigger B, so the more significant difference between Trigger signal and other signals in the upper channel makes Trigger A a better choice to work as a trigger to capture target signal.
Align Signals Now we have measured almost aligned EM traces with our delicate trigger, but it is still not enough to perform a successful attack. More accurate trace alignment techniques are necessary. By zooming in the first round encryption of the lower signal in Fig. 2, more details of the first round encryption are showed in Fig. 3. First of all, we observe the special patterns on the signal and find a two-trough (C in Fig. 3) pattern that is shared by all traces. The pattern is very likely an ideal reference to align all traces. Second, we match the pattern among several traces and find that the pattern in different traces are strongly correlated (Pearson Correlation Coefficient, PCC > 0.70). Third, for all traces we search the pattern by fixing one trace and sliding the others within a small range to find the position at which the pattern hold the maximum PCC with the pattern in the fixed trace. We exclude Fig. 3 Overview of the EM measurement of the first round AES encryption the traces whose maximum PCC is less than 0.70. Then the traces with the maximum PCC no less than 0.70 will be aligned properly.

The vulnerabilities of bit-level parallelism
Bitsliced implementation features bit-level parallelism, which means that the bits located at the same position of multiple plaintexts are processed simultaneously at any moment. As known, the 16-byte state of standard AES implementation is processed byte by byte, so its EM leakages at any moment is a function of a byte. However, for bitsliced AES implementation the bits at the same position of the standard states from multiple slices are gathered into a single register of a processor, so the EM leakages at any moment is a function of several bits from the standard states of multiple slices. Therefore, for the bitsliced version each secret key byte is likely leaked at eight different moments at least, which makes it possible for specific multi-bit combinational analysis methods to be effective. In this section, we investigate the side-channel vulnerabilities of a GPU-based bitsliced AES implementation from the perspectives of bit-level parallelism provided by the intrinsic properties of bitsliced ciphers with combinational analysis techniques.
We analyze the output of SubByte in the first AES round of a bitsliced AES implementation (Patrick). The C code snippet of the SubByte of the implementation is shown in Fig. 4, where U and S are the input and output of the function, respectively, and Tx and Lxx are temporary variables. The function processes the SubByte of one byte for multiple slices. The length of a word_t is 32-bit for the target GPU, so 32 slices are processed simultaneously in any single GPU thread.
We that for each byte in standard AES state the eight bits are processed independently, and the same bit of multiple slices are processed simultaneously. The fact is of great importance for the research in this section.

Non-profiled leakage detection test
Since the 128 bits in standard AES state are processed at different instants of time, the leakages from the 128 bits are not overlapped in the time domain. Therefore, it is possible for a non-profiled leakage detection test to each of the 128 bits to evaluate the amount of EM leakages from each individual bit. For other specific bitsliced AES implementations, the test results may differ much from ours, but the method itself works as well.
Specifically, Welch's t-test (Goodwill et al. 2011) is employed to detect the EM leakages from each of the 128 bits. Non-profiled leakage detection test requires specifying intermediates to be tested. In our test, we suppose that the 128 bits after the SubByte operation be the target intermediates.
Because of the multi-slice plaintexts used in bitsliced AES, we take into account of the CSSCA and CTSCA Fig. 4 The C code snippet of the SubBytes of a bitsliced AES mode when performing a test. As showed in Fig. 1, the state of 32-slice bitsliced AES encryption can be formalized as a 128 × 32 binary matrix: where b j i corresponds to the (i + 1)-th bit in standard AES state of the j-th slice. The 32 bits in each row of I are stored into an independent register, so 128 registers are required to hold the (128 × 32)-bit bitsliced AES state. To reduce noises, our test is conducted in S1G encryption mode, in which 32 slices are fed with the same plaintexts, so that each of the 128 registers hold either 32 zeros or 32 ones instead of 2 32 possible values. The rationale is that to distinguish one of two values from the other is much easier than to distinguish one of 2 32 values from the others. To further reduce noises, 32 threads in a warp also run the same plaintexts, which is also referred to as T1G encryption mode.

Test Method
The procedure of t-test consists of three steps. For our target implementation, t-test will be per- Then, t-statistic at each sampling point is computed: where μ 0 (τ ), μ 1 (τ ) are the means of G 0 and G 1 at τ in time (or the τ -th sample point, so τ ∈[ 1, M] ∩Z), and s 0 (τ ) and s 1 (τ ) are the standard deviations of G 0 and G 1 at τ in time, and n 0 and n 1 are the cardinality of G 0 and G 1 . Last, it is time to determine whether the two sets G 0 and G 1 are sampled from the same population or not. Generally speaking, two sets are assumed to be sampled from two distinct populations, if the statistical quantity |t(τ )| > 4.5 at some τ -s (Schneider and Moradi 2015). We also follow this convention in our test.
where t ,j corresponds to t(j). As a matter of fact, the values of t-statistic in the time domain is not that important, because we intend to determine whether rather than when any of 128 intermediate bytes is leaked or not. Hence, the maximum of the values in each row of T is sufficient, so we define

Test Results and Discussions
Each consecutive 8 values, for example t 0 , t 1 , ...t 7 , in (t 0 , t 1 , t 2 , ..., t 127 ) T (Eq. 5) corresponds to one byte of standard AES state. As is known from standard AES algorithm, the 8 values all depend on the same key byte, so we say that each key byte is leaked through at least 8 intermediate bytes that are executed the same operation. This is also the essence of the distinctive leakage feature provided by the bit-level parallelism. In this study, the distinctive leakage feature is formally defined as a (16 × 8) matrix: t 0 t 1 t 2 · · · t 7 t 8 t 9 t 10 · · · t 15 t 16 t 17 t 18 · · · t 23 . . . . . . . . . . . . . . .

A simple single-bit correlation analysis
We learn from the above that all 16 secret key bytes are leaked from the most significant bit (MSB) of the respective intermediate bytes, because all elements in the 8th column of F are equal to 1. Therefore, it is possible to recover all 16 secret key bytes if an MSB-based correlation EM analysis (Brier et al. 2004) (MSB-CEMA) is employed. The leakage model of the MSB-CEMA is where L n denotes the predicted EM leakage of n-th intermediate byte, I[ ·, ·] denotes the intermediate matrix in Eq. 2, and a is a scale factor, the value of which is insignificant. In addition, N noise is a Gaussian noise. The single-bit CEMA related above makes use of the leakages from the MSB of each intermediate byte. In fact, each secret key byte is leaked from more than one bit of relevant intermediate byte. For example, the first secret key byte is leaked from the 1st, 2nd, 6th and 8th bit and the second secret key byte is leaked from the 1st and 4th bit. Therefore, it is possible to take full advantage of the multi-bit leakages of a certain key byte to improve performance.

Multi-bit combinational analysis
The first combinational method is proposed to be multibit feature combinational analysis (MB-FCA). Just as its name suggests, MB-FCA makes the best of showed in Eq. 6 to determine which of the eight bits is used to compute the predicted leakage. Specifically, the bit is selected as follows: The rationale is that the intermediate bit with the maximum amount of EM leakage is the best candidate to predict the measured EM leakages.
Let B :=[ b 1 , b 2 , b 3 , ..., b 16 ], and it is called combined multi-bit leakage feature (cMBLF). The B is the essence of the MB-FCA and used to construct the following leakage model: With the leakage model derived from cMBLF, a simple CEMA will suffice to recover the 16-byte secret key.
The second combinational method is proposed to be multi-bit decision combinational analysis (MB-DCA). Compared with MB-FCA, MB-DCA does not care about the selection of intermediate bit before modeling leakages but tries on every leaked bits then makes a decision on the results of their respective analysis.
The first step of MB-DCA is to perform analysis on every leaked intermediate bits based on the MBLF, so the leakage model is: (12) where J n {i} denotes the i-th element in J n , and L Obviously, it is feasible to recover the 16 secret key bytes of AES. More details about the MB-DCA is showed in Algorithm 1.

Experimental results and discussion
Since the number of multi-slice groups and multi-thread groups for the GPU-based bitsliced AES encryption of (32 × 32 × 16)-byte plaintext within a GPU warp have nothing to do with MB-FCA and MB-DCA, our experiments are performed in a simplified mode, say T1G-S1G encryption mode. Therefore, only one 128-bit plaintext is required in order to obtain one EM trace. As mentioned above, an MBLF and cMBLF must be extracted from F and , respectively, before the MB-FCA or MB-DCA is applied. Finally, we obtain an MBLF J and a cMBLF B in our setting: for m ← 1 to M do 10: , 8, 7, 8, 1, 8, 8, 8, 1, 8, 7, 8, 1, 8, 7, 8] We compare the performance of MB-FCA and MB-DCA with MSB-CEMA in our experiments. The experimental results shows that a complete key-recovery attack with MSB-CEMA requires 900 EM traces at least (Fig. 7), while MB-FCA and MB-DCA are almost equivalent and require 500 EM traces at least. In fact, any scalar of B is very likely to be null if all values of the corresponding row in are within [ −4.5, 4.5]. In this case, 16 key bytes can not be recovered completely. Formally, a complete keyrecovery with MB-FCA or MB-DCA is feasible only if F satisfies We have to note that the MB-FCA or MB-DCA can not work if no prior leakage detection test is available, because both methods are based on the prior knowledge of . That is to say, MB-FCA and MF-DCA are essentially profiled side-channel analysis techniques like template attacks (Chari et al. 2002). However, our methods are more practical than template attacks, because for devices of certain architectural model the profiling is done only once to extract of the architecture before attacking any device of this architecture, while both profiling and attacking in template attacks usually target identical device, which makes template attacks less practical than our methods. In addition, template attacks require very low noise level, so they are almost ineffective for GPU-based cryptographic implementations.

The vulnerabilities of thread-level parallelism
GPU-based bitsliced implementation achieves threadlevel parallelism because of the SIMT execution fashion of GPUs, which means multiple threads execute the same program in parallel. For simplicity, we consider the parallelism within a GPU warp, because the threads in a warp execute the same instruction at any moment if warp divergence does not happen. In other words, the threads in a warp achieve a full synchronization in the time domain. However, the full synchronization among multiple thread executions does not necessarily imply a full synchronization of their respective EM leakages. In this section, we investigate the vulnerabilities of a GPU-based bitsliced

Developing a simple attack
As mentioned above, multiple thread executions within a GPU warp are synchronous, so their leakages are always considered to be synchronous as well. Suppose the Hamming weight leakages in multiple thread executions be summable, say E N n=1 HW(I n ) = N n=1 E(HW(I n )), where I 1 , I 2 , I 3 , ... I N are some intermediates within N threads, and E denotes an ideal EM leakage model without any noise, say E(x) = a · x for a positive a. Then we construct a simple synchronous model (SSM) for multithread leakages in a warp: I in the Equation is defined as follows: where L m n denotes the predicted leakages based on the m-th intermediate bit, The above SSM is based on another assumption, that is, the Hamming weight leakages in different threads are of the same scale. Otherwise, the model should be: The model essentially gives different weights to the predicted leakages from 32 threads, which should be more accurate than the previous SSM. Nevertheless, we will not study any analysis methods based on this model, because the value of a 1 : a 2 : a 3 : ... : a 32 has to be known somehow in advance, which we think is not very practical unless exhaustion. SSM assumes that the EM leakage of multiple threads are synchronous. If all of multi-thread leakages are not synchronous, it comes to another leakage model called Partial Synchronous Model (PSM): (18) where I ⊂ {1, 2, ..., 32} if the EM leakages of 32 threads are considered. Obviously, the key problem of constructing a PSM is to obtain the I some way.

Evaluations and Discussions
In order to evaluate the performance of SSM-based CEMA (SSM-CEMA) to the GPU-based bitsliced AES implementation, we set up experiments in T2G-S1G, T8G-S1G and T32G-S1G encryption modes. As shown in Fig. 8, the 16 secret key bytes of AES can be recovered with about 1,500 EM traces Fig. 8 The experimental results of SSM-CEMA in T2G-S1G, T8G-S1G and T32G-S1G encryption mode, respectively in T2G-S1G mode, while none of the 16 secret key bytes is recovered with up to 4,000 EM traces in T8G-S1G or T32G-S1G mode. This suggests that the performance of SSM-CEMA becomes worse in more grouped thread encryption mode, because more noises are introduced and less accurate the SSM becomes. The latter reason will be verified in the next experiment. In order to verify the accuracy of the SSM in T2G-S1G encryption mode. we compare the SSM with a PSM selecting half of 32 threads in a warp. As shown in Fig. 9, the PSM-CEMA with the first half of 32 threads performs better than the SSM-CEMA, which suggests that the PSM is more accurate than SSM. Therefore, the EM leakages in a warp may not be synchronous and a further study is needed.

Profiled correlation-based leakage detection test
To further understand the nature of the above parallel EM leakage, we employ a profiled leakage detection test method to analyze the individual leakages of multiple threads in a warp. With profiled methods, we do not have to make any assumptions about the leakage model of the target implementation, which thereby lowers the prerequisite for an attack and simplifies the procedure. The profiled ρ-test method we use is originally due to Durvaux and Standaert Durvaux and Standaert (2016). The method takes advantage of the cross-validation techniques introduced in Durvaux et al. (2014) and applies to the leakage detection of all threads in a warp. For the leakage test of one thread, the leakages from any other threads are treated as random noises. Specifically, the ρ-test is carried out in three steps: First, N EM traces with random plaintext inputs are sampled. For k-fold cross-validation, the set of acquired traces T is split into k (we set k = 10) non-overlapping subsets T (1) , T (2) , ..., T (k) of (approximately) the same size. For i = 1, 2, 3, ..., k, we define the profiling sets T p . For each target plaintext byte variable X m,n with m ∈ {1, 2, 3, ..., 32}, n ∈ {1, 2, 3, ..., 16} and for each cross-validation set j with j ∈ {1, 2, 3, ..., k}, a model is estimated:model m, n). For 8-bit plaintext bytes, this model corresponds to the sample means of the leakage sample τ corresponding to each value of the plaintext bytes.
In our experiment, the ρ-statistic of 32 individual tests at each time τ are evaluated and plotted within a single figure as Fig. 10. It shows that there are approximately five groups of leakage points marked as POI 1 , POI 2 , POI 3 , POI 4 and POI 5 , respectively. Around any of the five groups, multiple colors are accumulated, which seems that the leakages from 32 threads happen at the same time. Fig. 12 The experimental results of SSM-CEMA, wPSM-CEMA and MT-DCA in T32G-S1G encryption mode However, it is not like this when zooming in any of the five groups of leakage points. As is showed in Fig. 11, the detail of POI 1 , POI 2 , POI 4 and POI 5 tells that not all leakages from multiple threads are synchronous as we usually think. This discovery is so important because we always expect synchronous executions to generate synchronous leakages instead of asynchronous ones. It is obvious that the executions of T 1 and T 2 are almost overlapping, the same with T 17 and T 18 . Since we cannot deny the existence of leakages when there is not any indication in the figure, we still do not know whether any leakage happens or not in other threads except T 1 , T 2 , T 17 and T 18 . We do not know why the device leaks information in these special threads, and we think it may have something to do with the hardware architecture of CUDA-enabled GPU as well as the target software implementation.

Multi-thread combinational analysis
The above experiments have demonstrated that the EM leakages from multiple threads in a warp is not exactly synchronous. To make full use of the EM leakages from multiple asynchronous threads, we propose two combinational methods.
The first method we propose is Multi-Thread Decision Combinational Analysis (MT-DCA) based on the wMTLF. If the j-th bit of a intermediate byte is used to model the leakage of single thread, then the MT-DCA is based on the following leakage model: To recover the n-th secret key byte, | n | CEMAs should be performed before making a decision: Obviously, it is feasible to recover the 16 secret key bytes of AES.
The above MT-DCA depends on wMTLF that indicates which of the 32 threads in a warp is leaky rather than whether these leakages are synchronous or not. To make full use of the synchronization of leakages, we propose the second combinational method called Multi-Thread Hybrid Combinational Analysis (MT-HCA) based on sMTLF. If the j-th bit of a intermediate byte is used to model the leakage of single thread, then the MT-HCA is based on the following leakage model: where w ∈ {1, 2, ..., | n {w, ·}|} and L w n denotes the predicted leakage of the j-th bit of the n-th intermediate in the w-th grouped threads whose leakages are thought to be synchronous.
The same way as MT-DCA does, | n | CEMAs should be performed before making a decision: where ρ w n is a matrix of Pearson Correlation Coefficient obtained from CEMAs based on the model L w n . Obviously, it is feasible to recover the 16 secret key bytes of AES. More details of MT-HCA is showed in Algorithm 2.

Experimental results and discussion
Since both MT-DCA and MT-HCA take the advantage of EM leakages from multiple threads, our experiments are performed in T32G-S1G encryption mode. MT-DCA and MT-HCA depend on wMTLF and sMTLF, respectively, so we extract the wMTLF and sMTLF of our experimental Algorithm 2 Multi-Thread Hybrid Combinational Analysis (MT-HCA) NOTE: The attack is based on the leakage of the MSB of each intermediate, so α = 8 in the following procedure. Input: [ p (j,l) i,n ] i∈{1,2,..., N},j,l∈{1,2,...,32},n∈[0,15]  We compare the performance of five methods showed in Table 1. wPSM-CEMA computes predicted leakages by one thread in wMTLF and sPSM-CEMA computes predicted leakages by one grouped threads in sMTLF. In fact, both wPSM-CEMA and sPSM-CEMA are noncombinational methods.
First, we compare the performance of combinational method with non-combinational ones when wMTLF is available. The experimental results are showed in Fig. 12. wPSM-CEMA is a wMTLF-based non-combinational method and it performs worse than the wMTLF-based combinational method MT-DCA.
Second, we compare the performance of combinational method with non-combinational ones when sMTLF is available. The experimental results are showed in Fig. 13. sPSM-CEMA is a sMTLF-based non-combinational method and it perform worse than the sMTLF-based combinational method MT-HCA.
Both MT-DCA and MT-HCA performs better than their non-combinational counterparts because they make use of multi-thread leakages instead of a single one. It is reasonable that more usable leakages make better performance. In addition, the experimental results show that SSM-CEMA is not effective at all with up to 5,000 traces, because SSM-CEMA does not depend on any leakage feature, thus more noises are introduced when modelling with SSM.

Conclusions
In this paper, we investigate efficient electro-magnetic analysis of a GPU bitsliced AES implementation in order to give a deep insight into the vulnerabilities of bit-level parallelism and thread-level parallelism. We propose GPU-specific efficient combinational analysis methods and the methods are experimentally proved to be more efficient than the non-combinational ones. Our research suggests that multi-thread leakages can be used to improve attacks if the multi-thread leakages are not synchronous in the time domain.