- Research
- Open Access
- Published:

# Efficient electro-magnetic analysis of a GPU bitsliced AES implementation

*Cybersecurity*
**volume 3**, Article number: 3 (2020)

## Abstract

The advent of CUDA-enabled GPU makes it possible to provide cloud applications with high-performance data security services. Unfortunately, recent studies have shown that GPU-based applications are also susceptible to side-channel attacks. These published work studied the side-channel vulnerabilities of GPU-based AES implementations by taking the advantage of the cache sharing among multiple threads or high parallelism of GPUs. Therefore, for GPU-based bitsliced cryptographic implementations, which are immune to the cache-based attacks referred to above, only a power analysis method based on the high-parallelism of GPUs may be effective. However, the leakage model used in the power analysis is not efficient at all in practice. In light of this, we investigate electro-magnetic (EM) side-channel vulnerabilities of a GPU-based bitsliced AES implementation from the perspective of bit-level parallelism and thread-level parallelism in order to make the best of the localization effect of EM leakage with parallelism. Specifically, we propose efficient multi-bit and multi-thread combinational analysis techniques based on the intrinsic properties of bitsliced ciphers and the effect of multi-thread parallelism of GPUs, respectively. The experimental result shows that the proposed combinational analysis methods perform better than non-combinational and intuitive ones. Our research suggests that multi-thread leakages can be used to improve attacks if the multi-thread leakages are not synchronous in the time domain.

## Introduction

Nowadays as the most widely used parallel computing platform, Graphics Processing Unit (GPU) has evolved from a special hardware for graphics rendering into a general-purpose computing device for various applications as biomedical analysis, signal processing, scientific computing and so on. GPU executes program in a Single-Instruction, Multiple-Thread (SIMT) fashion, so it is well suited for cryptographic applications deployed in cloud computing environment to provide the Security-as-a-Service (SECaaS). Unfortunately, GPU-based applications are vulnerable to many known attacks as proposed in Di et al. (2016); Naghibijouybari et al. (2018); Jiang et al. (2016). Among those published vulnerabilities of GPUs, side-channel vulnerabilities are the most serious ones due to their non-invasiveness to target devices. In recent years, the study on the side-channel attacks against cryptographic implementations have always been a research hotspot of cryptanalysis beyond algebraic analysis methods. As the most popular block cipher, AES has been widely deployed on a variety of hardware platforms. The side-channel attacks against CPU-based AES software implementations and FPGA-based hardware implementations have been deeply investigated. Until very recently, some literatures mentioned that GPU-based cryptographic implementations are also susceptible to side-channel attacks through electro-magnetic (EM) emanation (Gao et al. 2018; Gao et al. 2018), power consumption (Luo et al. 2015) or execution time leakages (Jiang et al. 2016; 2017). Thanks to the bitsliced cipher proposed by Biham Biham (1997) as well as the efficient GPU-based bitsliced AES implementations proposed by Lim et al. Lim et al. (2016) and Nishikawa et al. Nishikawa et al. (2017), we are capable of deploying GPU-based AES cryptosystems that are resistant to cache-timing attacks (Jiang et al. 2016; 2017) and cache-based EM attacks (Gao et al. 2018; Gao et al. 2018). However, it does not mean that GPU-based AES implementation is not vulnerable to other side-channel attacks, for example the power analysis attack proposed in (Luo et al. 2015), though it is not efficient at all in practice. In light of this, we study more efficient side-channel attacks against a GPU-based bitsliced AES implementation in order to give a deeper insight into the side-channel vulnerabilities of GPU-based cryptographic implementations in the perspective of parallelism.

### Related work

Luo et al. proposed the first power analysis attack against a GPU-based AES implementation in (Luo et al. 2015). They inserted a resistor in series with power supply in order to measure the power consumption of GPU card. They targeted a *T*-table-based GPU AES implementation and built a simplified leakage model to avoid the synchronization of power traces in the time domain. They employed correlation power analysis (CPA) to recover 16-byte secret key of AES with 160,000 power traces. Their attack is performed in a chosen-thread mode, which requires the adversary be capable of encrypting the same plaintexts for all block threads. In fact, it is almost impossible to conduct side-channel attacks successfully in known-plaintext or highly-occupied scenarios against GPU-based cryptographic implementations. After that Jiang et al. proposed two cache-based timing attacks against *T*-table-based GPU AES implementation based on the time differences induced by L1 cache line access serialization (Jiang et al. 2016) and shared memory bank conflict (Jiang et al. 2017). They recovered the 16-byte secret key of a GPU-based AES implementation by correlation timing analysis and differential timing analysis, respectively. Very recently, Gao et al. proposed electro-magnetic analysis attacks against a GPU-based AES implementation based on the cache line access coalescing (Gao et al. 2018; Gao et al. 2018), which are proved to be much efficient.

### Contributions

To the best of our knowledge, this is the first work that investigates EM side-channel vulnerabilities of GPU-based bitsliced cryptographic implementation in the perspective of both bit-level parallelism and thread-level parallelism. The contributions are as follows:

First, we study the vulnerabilities of bit-level parallelism in a GPU-based bitsliced AES implementation. With the help of a multi-bit EM leakage features extracted from non-profiled *t*-test, we construct two special multi-bit combinational analysis methods, namely multi-bit feature combinational analysis and multi-bit decision combinational analysis, which take the full advantage of the bit-level parallelism of bitsliced ciphers and are experimentally proved to be more efficient than traditional single-bit CEMA.

Second, we also study the vulnerabilities of thread-level parallelism in the same AES implementation referred to above. In the study, a profiled correlation-based leakage detection test is employed to extract a multi-thread EM leakage feature, which is later used to construct special multi-thread combinational analysis methods. The proposed methods make the best of the thread-level parallelism of GPUs and our experimental result shows that the proposed combinational methods outperform traditional non-combinational ones.

### Organization

The rest of this paper is organized as follows. In “Preliminary” section, we give a brief introduction of CUDA-enabled GPUs, a GPU-based bitsliced AES implementation, and definitions and notations involved in this paper. In “EM measurement” section, we present special techniques for leakage acquisition and preprocessing. In “The vulnerabilities of bit-level parallelism” and “The vulnerabilities of thread-level parallelism” sections, we investigate the side-channel vulnerabilities of bit-level parallelism and thread-level parallelism of the GPU-based bitsliced AES implementation, respectively. Finally, conclusions are given in “Conclusions” section.

## Preliminary

In this section, we give a brief introduction to the architecture of CUDA-enabled GPUs, the features of GPU-based bitsliced AES implementation as well as the definitions and notations involved in this paper.

### CUDA-enabled GPU

Compute Unified Device Architecture (CUDA) is a general-purpose parallel computing framework and programming model developed by NVIDIA for its GPUs. In a physical view, the CUDA-enabled GPU is composed of M × Streaming Multiprocessors (SM) and a global memory. Each SM has N × Scalar Processor (SP), a shared memory, several 32-bit registers, and a shared instruction unit. In an abstract view, CUDA defines the threading model, calling conventions and memory hierarchy for programmers.

Warps are the basic unit of execution in an SM. When you launch a grid of thread blocks, the thread blocks in the gird are distributed among SMs. Once a thread block is scheduled to an SM, threads in the thread block are further partitioned into warps. A warp consists of 32 consecutive threads and all threads in a warp are executed in SIMT fashion; that is, all threads execute the same instruction, and each thread carries out that operation on its own private data.

### GPU-based bitsliced AES implementation

The terminology *bitsliced cipher* was first proposed by Biham Biham et al. (1998) referring to the AES candidate *Serpent*. Precisely speaking, bitsliced cipher is a concept about cryptographic implementation instead of cryptographic algorithm or scheme itself.

The AES implementation of bitsliced version could process more than one 128-bit plaintext in a parallel fashion. The parallelism is determined by the word-length of a processor. For 32-bit processors, 32 128-bit plaintexts can be encrypted in parallel, which is also mentioned as *bit-level parallelism*. The first step of a bitsliced AES implementation is to transpose multiple plaintexts by bit in order to adapt bitsliced execution fashion. As showed in Fig. 1, 32 128-bit plaintexts are arranged by row, and each plaintext is written to or read from four 32-bit registers within GPU. The 32 ×128 matrix is transposed before the first round encryption, and the inverse transposition is performed after the final round encryption. Obviously, only one forward transposition and one inverse transposition are needed to finish one bitsliced AES encryption on a single GPU thread. For multiple thread executions on GPU, each thread executes the above process independently, which is also referred to as *thread-level parallelism*. In a word, GPU-based bitsliced AES implementation achieves parallelism in two dimensions, namely bit-level parallelism and thread-level parallelism.

### Definitions and notations

###
**Definition 1**

For a bitsliced AES implementation on a 32-bit processor, 32 16-byte standard AES state is mapped as Fig. 1 to an (8×16)-sized matrix:

where \(W_{i}^{j}\) is a 32-bit value. The matrix is called *bitsliced AES state*, which is opposite to standard AES state. In addition, each bit of \(W_{i}^{j}\) is also called one *slice*.

###
**Definition 2**

For a side-channel attack to a multi-thread cryptographic implementation, if an attacker is able to choose the same random plaintexts for some threads and collect the corresponding ciphertexts and side-channel leakages, then the attack is called *Chosen-Thread Side-Channel Attack* (CTSCA). Please note that attackers cannot choose specific plaintext for any thread in CTSCA mode, which is different from chosen-plaintext side-channel attack.

###
**Definition 3**

For a CTSCA to an **(m ×n)**-thread parallel cryptographic implementation, if an attacker is able to assign the same random plaintext to every consecutive **m** threads, the CTSCA is called **n**-group multi-thread CTSCA or T**n**G CTSCA for short, and the chosen-thread encryption is called **n**-group multi-thread encryption or T**n**G encryption for short.

###
**Definition 4**

For a side-channel attack to a bitsliced cipher implemented on an **m ×n**-bit word-length processor, if an attacker is able to give the same plaintext for every consecutive **m** slices, the attack is called **n**-group multi-slice CSSCA or S**n**G CSSCA for short, and the chosen-slice encryption is called **n**-group multi-slice encryption or S**n**G encryption for short.

**Notations**.

In this paper, · and ⋆ denote quantifier *all* and *any*, respectively. For example, if \(\mathcal {M}\) is a 5×10 matrix, then \(\mathcal {M}[2,\cdot ]\) denotes the 10 elements of the second row of the \(\mathcal {M}\), which is also a 10-dimensional vector, and \(\mathcal {M}[\star,3]\) denotes any one of the 5 elements in the third column of the \(\mathcal {M}\), which is also regarded as a set composed of 5 elements.

\(\text {\tt {Bit}}_{j}(\mathcal {R})\): the function outputs the *j*-th bit of each component of \(\mathcal {R}\) in its original format. For example, if \(\mathcal {R}=[0x10010101,0x01011101]\), then \(\text {\tt {Bit}}_{1}(\mathcal {R})=[0x1,0x1]\) and \(\text {\tt {Bit}}_{4}(\mathcal {R})=[0x0,0x1]\).

## EM measurement

Electro-magnetic emanation around electronic devices can be captured without any difficulties, but it is not so easy to measure useful signals. Compared with power analysis, EM analysis enables us to take the advantage of localization effects, which makes EM attacks more efficient than power analysis attacks. We use two small magnetic probes *Rohde Schwarz RF B 3-3* and *Rohde Schwarz RS H 2.5-2* instead of larger ones in order to probe localized leakages from near-field emanation (Agrawal et al. 2002). Theoretically, the region located less than 1/2*π* of wavelength away from the source is called *near-field*. All our probings in this work are conducted in this region.

**Set-ups**The testbed in this work is set up with the following configurations:

We target an NVIDIA’s GeForce GT 620 graphics card connected to the host with PCI-e bus. Though the device is of low performance, it is enough to show the vulnerability of NVIDIA’s GPU to EM attacks. Specifically, the device has one streaming multiprocessor of 48 SPs, an L2 cache of 64KiB, and it is equipped with an off-chip device memory of 454MiB. The device is running at 1.27GiHz.

We port a bitsliced AES implementation of an open source community (Patrick) into the GPU. Since it is a table-free implementation, we do not need to consider the efficiency of table look-up with respect to different types of memory. The device memory in our GPU is used to store the plaintexts to be encrypted as well as the ciphertexts to be produced.

We employ an Agilent DSO9104A digital oscilloscope, which is capable of measuring signals with a sampling rate up to 20GHz (20GSa/s). We set our sampling rate as 200MSa/s, which turns out to be enough for our experiments.

Our testbed is set up in a client/server mode which is widely used for internet applications. Specifically, in cloud computing environment, cloud devices that provide SECaaS work as servers, and inside attackers (Duncan et al. 2015) are authorized to encrypt any plaintexts \(\mathcal {P}\)-s then obtain the corresponding ciphertexts \(\mathcal {C}\)-s and measured EM traces \(\mathcal {T}\)-s. With a sufficient number of triples \(\langle \mathcal {P},\mathcal {C},\mathcal {T}\rangle \), the attackers attempt to recover the preset secret key of our GPU bitsliced AES implementation.

**Locate Signals**A printed circuit board such as GPU card is usually composed of hundreds of electronic components like chips, resistors, capacitors, inductors and so on. However, it is not necessary to check all of them to locate target signals. Generally speaking, only the right above of GPU chip and the capacitors on the back of GPU chip should be checked, because these positions or components tend to produce useful leakages, which is also confirmed afterwards in our experiments. More specifically, we start up the CUDA program and run the encryption procedure in a loop. We adjust EM probe on the candidate components within their near-field zones until we find a position in which the oscilloscope captures a periodic signal. If some patterns within the signal repeat nine to ten times, leakage positions are found. A repeated signal in our experiments is showed in Fig. 2. We call it *target signal*.

**Collect Signals**Although the target signal is identified, it is still not easy to capture it without external triggers. In fact, it is impractical to provide an external trigger controlled within program, so we design a delicate trigger with another magnetic probe. As shown in Fig. 2, two signals measured at different probing positions look similar, and the amplitude in the upper one is basically less than that in the lower one. However, the two signals share a signal pattern of the same high voltage marked as *Trigger A* and *Trigger B*, so the more significant difference between *Trigger* signal and other signals in the upper channel makes *Trigger A* a better choice to work as a trigger to capture target signal.

**Align Signals**Now we have measured almost aligned EM traces with our delicate trigger, but it is still not enough to perform a successful attack. More accurate trace alignment techniques are necessary. By zooming in the first round encryption of the lower signal in Fig. 2, more details of the first round encryption are showed in Fig. 3. First of all, we observe the special patterns on the signal and find a *two-trough* (*C* in Fig. 3) pattern that is shared by all traces. The pattern is very likely an ideal reference to align all traces. Second, we match the pattern among several traces and find that the pattern in different traces are strongly correlated (Pearson Correlation Coefficient, *P**C**C*>0.70). Third, for all traces we search the pattern by fixing one trace and sliding the others within a small range to find the position at which the pattern hold the maximum *P**C**C* with the pattern in the fixed trace. We exclude the traces whose maximum *P**C**C* is less than 0.70. Then the traces with the maximum *P**C**C* no less than 0.70 will be aligned properly.

## The vulnerabilities of bit-level parallelism

Bitsliced implementation features bit-level parallelism, which means that the bits located at the same position of multiple plaintexts are processed simultaneously at any moment. As known, the 16-byte *state* of standard AES implementation is processed byte by byte, so its EM leakages at any moment is a function of a byte. However, for bitsliced AES implementation the bits at the same position of the standard *states* from multiple slices are gathered into a single register of a processor, so the EM leakages at any moment is a function of several bits from the standard states of multiple slices. Therefore, for the bitsliced version each secret key byte is likely leaked at eight different moments at least, which makes it possible for specific multi-bit combinational analysis methods to be effective. In this section, we investigate the side-channel vulnerabilities of a GPU-based bitsliced AES implementation from the perspectives of bit-level parallelism provided by the intrinsic properties of bitsliced ciphers with combinational analysis techniques.

We analyze the output of SubByte in the first AES round of a bitsliced AES implementation (Patrick). The *C* code snippet of the SubByte of the implementation is shown in Fig. 4, where U and S are the input and output of the function, respectively, and T*x* and L*x**x* are temporary variables. The function processes the SubByte of one byte for multiple slices. The length of a word_t is 32-bit for the target GPU, so 32 slices are processed simultaneously in any single GPU thread.

We know from S[0], S[1], S[2], S[3], S[4], S[5], S[6] and S[7] that for each byte in standard AES state the eight bits are processed independently, and the same bit of multiple slices are processed simultaneously. The fact is of great importance for the research in this section.

### Non-profiled leakage detection test

Since the 128 bits in standard AES state are processed at different instants of time, the leakages from the 128 bits are not overlapped in the time domain. Therefore, it is possible for a non-profiled leakage detection test to each of the 128 bits to evaluate the amount of EM leakages from each individual bit. For other specific bitsliced AES implementations, the test results may differ much from ours, but the method itself works as well.

Specifically, Welch’s *t*-test (Goodwill et al. 2011) is employed to detect the EM leakages from each of the 128 bits. Non-profiled leakage detection test requires specifying intermediates to be tested. In our test, we suppose that the 128 bits after the SubByte operation be the target intermediates.

Because of the multi-slice plaintexts used in bitsliced AES, we take into account of the CSSCA and CTSCA mode when performing a test. As showed in Fig. 1, the *state* of 32-slice bitsliced AES encryption can be formalized as a 128×32 binary matrix:

where \(b_{i}^{j}\) corresponds to the (*i*+1)-th bit in standard AES state of the *j*-th slice. The 32 bits in each row of \(\mathcal {I}\) are stored into an independent register, so 128 registers are required to hold the (128×32)-bit *bitsliced AES state*. To reduce noises, our test is conducted in *S*1*G* encryption mode, in which 32 slices are fed with the same plaintexts, so that each of the 128 registers hold either 32 zeros or 32 ones instead of 2^{32} possible values. The rationale is that to distinguish one of two values from the other is much easier than to distinguish one of 2^{32} values from the others. To further reduce noises, 32 threads in a warp also run the same plaintexts, which is also referred to as *T*1*G* encryption mode.

**Test Method**The procedure of *t*-test consists of three steps. For our target implementation, *t*-test will be performed on each of the 128 intermediate bytes, say *b*_{0},*b*_{1},*b*_{2},...,*b*_{127}, where \(b_{i}:=b_{i}^{1}b_{i}^{2}b_{i}^{3}\ldots b_{i}^{32}\). Now take the first intermediate byte *b*_{0} for example.

First, *N* EM trace samples of *M* sampling points in the time domain are partitioned into two groups, namely *G*_{0} and *G*_{1}, with respect to their corresponding intermediate byte *b*_{0}=0 or *b*_{0}=2^{32}−1.

Then, *t*-statistic at each sampling point is computed:

where *μ*_{0}(*τ*), *μ*_{1}(*τ*) are the means of *G*_{0} and *G*_{1} at *τ* in time (or the *τ*-th sample point, so \(\tau \in [1,M]\cap \mathbb {Z}\)), and *s*_{0}(*τ*) and *s*_{1}(*τ*) are the standard deviations of *G*_{0} and *G*_{1} at *τ* in time, and *n*_{0} and *n*_{1} are the cardinality of *G*_{0} and *G*_{1}.

Last, it is time to determine whether the two sets *G*_{0} and *G*_{1} are sampled from the same population or not. Generally speaking, two sets are assumed to be sampled from two distinct populations, if the statistical quantity |*t*(*τ*)|>4.5 at some *τ*-s (Schneider and Moradi 2015). We also follow this convention in our test.

The same tests are carried out for the leakages from the other 127 intermediate bytes, say *b*_{1}, *b*_{2},..., *b*_{127}. As a result, 128×*M**t*-statistics are obtained:

where *t*_{⋆,j} corresponds to *t*(*j*).

**Test Results and Discussions**As mentioned above, there are 128 registers to hold the (128×32)-bit *state* of bitsliced AES encryption. We perform the *t*-test on each of the 128 32-bit intermediate stored in the 128 registers. The test results are showed in Fig. 5. We observe that 16 peaks are clearly visible, and they are far beyond the preset thresholds [−4.5,4.5] for deciding if the value in a register is leaked or not. The 16 peaks in different colors represent the leakages of 16 SubByte outputs in the first round encryption of the bitsliced implementation.

As a matter of fact, the values of *t*-statistic in the time domain is not that important, because we intend to determine whether rather than when any of 128 intermediate bytes is leaked or not. Hence, the maximum of the values in each row of *T* is sufficient, so we define

Each consecutive 8 values, for example *t*_{0}, *t*_{1},... *t*_{7}, in (*t*_{0},*t*_{1},*t*_{2},...,*t*_{127})^{T} (Eq. 5) corresponds to one byte of standard AES *state*. As is known from standard AES algorithm, the 8 values all depend on the same key byte, so we say that each key byte is leaked through at least 8 intermediate bytes that are executed the same operation. This is also the essence of the distinctive leakage feature provided by the bit-level parallelism. In this study, the distinctive leakage feature is formally defined as a (16×8) matrix:

The matrix describes the leakages on each of the 128 intermediate bytes. The experimental results for *Γ* are also showed in Fig. 6. *Γ* corresponds to the histogram, but the *t*-s that are beyond [−4.5,4.5] in *Γ* are reduced to zeros in the histogram. We also define another matrix based on *Γ*:

where for any *i*∈{0,1,2,...,127}, the variable *ζ*_{i}=1 if *t*_{i}∉[−4.5,4.5]; otherwise the variable *ζ*_{i}=0. In addition, we define a Multi-Bit Leakage Feature (MBLF):

where \(\mathcal {J}_{n}\subset \{1,2,...,8\}\) denotes a set of indices of scalars that equals to 1 in \(\mathcal {F}[n,\cdot ]\).

### A simple single-bit correlation analysis

We learn from the above that all 16 secret key bytes are leaked from the most significant bit (MSB) of the respective intermediate bytes, because all elements in the 8th column of *F* are equal to 1. Therefore, it is possible to recover all 16 secret key bytes if an MSB-based correlation EM analysis (Brier et al. 2004) (MSB-CEMA) is employed. The leakage model of the MSB-CEMA is

where \(\mathcal {L}_{n}\) denotes the predicted EM leakage of *n*-th intermediate byte, \(\mathcal {I}[\cdot,\cdot ]\) denotes the intermediate matrix in Eq. 2, and *a* is a scale factor, the value of which is insignificant. In addition, \(\mathcal {N}_{noise}\) is a Gaussian noise.

The single-bit CEMA related above makes use of the leakages from the MSB of each intermediate byte. In fact, each secret key byte is leaked from more than one bit of relevant intermediate byte. For example, the first secret key byte is leaked from the 1st, 2nd, 6th and 8th bit and the second secret key byte is leaked from the 1st and 4th bit. Therefore, it is possible to take full advantage of the multi-bit leakages of a certain key byte to improve performance.

### Multi-bit combinational analysis

The first combinational method is proposed to be multi-bit feature combinational analysis (MB-FCA). Just as its name suggests, MB-FCA makes the best of *Γ* showed in Eq. 6 to determine which of the eight bits is used to compute the predicted leakage. Specifically, the bit is selected as follows:

The rationale is that the intermediate bit with the maximum amount of EM leakage is the best candidate to predict the measured EM leakages.

Let \(\mathcal {B}:=[b_{1},b_{2},b_{3},...,b_{16}]\), and it is called combined multi-bit leakage feature (cMBLF). The \(\mathcal {B}\) is the essence of the MB-FCA and used to construct the following leakage model:

With the leakage model derived from cMBLF, a simple CEMA will suffice to recover the 16-byte secret key.

The second combinational method is proposed to be multi-bit decision combinational analysis (MB-DCA). Compared with MB-FCA, MB-DCA does not care about the selection of intermediate bit before modeling leakages but tries on every leaked bits then makes a decision on the results of their respective analysis.

The first step of MB-DCA is to perform analysis on every leaked intermediate bits based on the MBLF, so the leakage model is:

where \(\mathcal {J}_{n}\{i\}\) denotes the *i*-th element in \(\mathcal {J}_{n}\), and \(\mathcal {L}_{n}^{\mathcal {J}_{n}\{i\}}\) denotes the predicted leakage of the \(\mathcal {J}_{n}\{i\}\)-th bit of the *n*-th intermediate.

To recover the *n*-th secret key byte, \(|\mathcal {J}_{n}|\) CEMAs should be performed before making a decision:

where \(\rho _{n}^{\mathcal {J}_{n}\{i\}}(\tilde {k})\) is a matrix of Pearson Correlation Coefficient obtained from CEMAs based on the model \(\mathcal {L}_{n}^{\mathcal {J}_{n}\{i\}}\). Obviously, it is feasible to recover the 16 secret key bytes of AES. More details about the MB-DCA is showed in Algorithm 1.

### Experimental results and discussion

Since the number of multi-slice groups and multi-thread groups for the GPU-based bitsliced AES encryption of (32×32×16)-byte plaintext within a GPU warp have nothing to do with MB-FCA and MB-DCA, our experiments are performed in a simplified mode, say *T*1*G*- *S*1*G* encryption mode. Therefore, only one 128-bit plaintext is required in order to obtain one EM trace. As mentioned above, an MBLF and cMBLF must be extracted from \(\mathcal {F}\) and *Γ*, respectively, before the MB-FCA or MB-DCA is applied. Finally, we obtain an MBLF \(\mathcal {J}\) and a cMBLF \(\mathcal {B}\) in our setting:

We compare the performance of MB-FCA and MB-DCA with MSB-CEMA in our experiments. The experimental results shows that a complete key-recovery attack with MSB-CEMA requires 900 EM traces at least (Fig. 7), while MB-FCA and MB-DCA are almost equivalent and require 500 EM traces at least. In fact, any scalar of \(\mathcal {B}\) is very likely to be null if all values of the corresponding row in *Γ* are within [−4.5,4.5]. In this case, 16 key bytes can not be recovered completely. Formally, a complete key-recovery with MB-FCA or MB-DCA is feasible only if \(\mathcal {F}\) satisfies

We have to note that the MB-FCA or MB-DCA can not work if no prior leakage detection test is available, because both methods are based on the prior knowledge of *Γ*. That is to say, MB-FCA and MF-DCA are essentially profiled side-channel analysis techniques like template attacks (Chari et al. 2002). However, our methods are more practical than template attacks, because for devices of certain architectural model the profiling is done only once to extract *Γ* of the architecture before attacking any device of this architecture, while both profiling and attacking in template attacks usually target identical device, which makes template attacks less practical than our methods. In addition, template attacks require very low noise level, so they are almost ineffective for GPU-based cryptographic implementations.

## The vulnerabilities of thread-level parallelism

GPU-based bitsliced implementation achieves thread-level parallelism because of the SIMT execution fashion of GPUs, which means multiple threads execute the same program in parallel. For simplicity, we consider the parallelism within a GPU warp, because the threads in a warp execute the same instruction at any moment if warp divergence does not happen. In other words, the threads in a warp achieve a full synchronization in the time domain. However, the full synchronization among multiple thread executions does not necessarily imply a full synchronization of their respective EM leakages. In this section, we investigate the vulnerabilities of a GPU-based bitsliced implementation from the perspective of thread-level parallelism.

### Developing a simple attack

As mentioned above, multiple thread executions within a GPU warp are synchronous, so their leakages are always considered to be synchronous as well. Suppose the Hamming weight leakages in multiple thread executions be summable, say \(E\left (\sum _{n=1}^{N}\text {\tt {HW}}(I_{n})\right)=\sum _{n=1}^{N}E(\text {\tt {HW}}(I_{n}))\), where *I*_{1}, *I*_{2}, *I*_{3},... *I*_{N} are some intermediates within *N* threads, and *E* denotes an ideal EM leakage model without any noise, say *E*(*x*)=*a*·*x* for a positive *a*. Then we construct a simple synchronous model (SSM) for multi-thread leakages in a warp:

\(\mathcal {I}'\) in the Equation is defined as follows:

where \(\mathcal {L}_{n}^{m}\) denotes the predicted leakages based on the *m*-th intermediate bit, \(B_{i}^{\star }:=b_{i}^{1}b_{i}^{2}...b_{i}^{32}\) is a 32-bit value, and \(\mathcal {I}'[\cdot,j]:=\left [B_{0}^{j},B_{1}^{j},B_{2}^{j},...,B_{127}^{j}\right ]^{\text {\tt {T}}}\) corresponds to the \(\mathcal {I}\) (Eq. 2) within the *j*-th thread of a warp.

The above SSM is based on another assumption, that is, the Hamming weight leakages in different threads are of the same scale. Otherwise, the model should be:

The model essentially gives different weights to the predicted leakages from 32 threads, which should be more accurate than the previous SSM. Nevertheless, we will not study any analysis methods based on this model, because the value of *a*_{1}:*a*_{2}:*a*_{3}:...:*a*_{32} has to be known somehow in advance, which we think is not very practical unless exhaustion.

SSM assumes that the EM leakage of multiple threads are synchronous. If all of multi-thread leakages are not synchronous, it comes to another leakage model called Partial Synchronous Model (PSM):

where \(\mathcal {I}\subset \{1,2,...,32\}\) if the EM leakages of 32 threads are considered. Obviously, the key problem of constructing a PSM is to obtain the \(\mathcal {I}\) some way.

**Evaluations and Discussions**In order to evaluate the performance of SSM-based CEMA (SSM-CEMA) to the GPU-based bitsliced AES implementation, we set up experiments in *T*2*G*- *S*1*G*, *T*8*G*- *S*1*G* and *T*32*G*- *S*1*G* encryption modes. As shown in Fig. 8, the 16 secret key bytes of AES can be recovered with about 1,500 EM traces in *T*2*G*- *S*1*G* mode, while none of the 16 secret key bytes is recovered with up to 4,000 EM traces in *T*8*G*- *S*1*G* or *T*32*G*- *S*1*G* mode. This suggests that the performance of SSM-CEMA becomes worse in more grouped thread encryption mode, because more noises are introduced and less accurate the SSM becomes. The latter reason will be verified in the next experiment.

In order to verify the accuracy of the SSM in *T*2*G*- *S*1*G* encryption mode. we compare the SSM with a PSM selecting half of 32 threads in a warp. As shown in Fig. 9, the PSM-CEMA with the first half of 32 threads performs better than the SSM-CEMA, which suggests that the PSM is more accurate than SSM. Therefore, the EM leakages in a warp may not be synchronous and a further study is needed.

### Profiled correlation-based leakage detection test

To further understand the nature of the above parallel EM leakage, we employ a profiled leakage detection test method to analyze the individual leakages of multiple threads in a warp. With profiled methods, we do not have to make any assumptions about the leakage model of the target implementation, which thereby lowers the prerequisite for an attack and simplifies the procedure. The profiled *ρ*-test method we use is originally due to Durvaux and Standaert Durvaux and Standaert (2016). The method takes advantage of the cross-validation techniques introduced in Durvaux et al. (2014) and applies to the leakage detection of all threads in a warp. For the leakage test of one thread, the leakages from any other threads are treated as random noises. Specifically, the *ρ*-test is carried out in three steps:

*First*, *N* EM traces with random plaintext inputs are sampled. For *k*-fold cross-validation, the set of acquired traces \(\mathcal {T}\) is split into *k* (we set *k*=10) non-overlapping subsets \(\mathcal {T}^{(1)},\mathcal {T}^{(2)},...,\mathcal {T}^{(k)}\) of (approximately) the same size. For *i*=1,2,3,...,*k*, we define the profiling sets \(\mathcal {T}_{p}^{(j)}=\bigcup _{i\neq j}\mathcal {T}^{(i)}\) and the test sets \(\mathcal {T}_{t}^{(j)}=\mathcal {T}\backslash \mathcal {T}_{p}^{(j)}\). For each target plaintext byte variable *X*_{m,n} with *m*∈{1,2,3,...,32}, *n*∈{1,2,3,...,16} and for each cross-validation set *j* with *j*∈{1,2,3,...,*k*}, a model is estimated: \(\hat {model}_{\tau }^{(j)}(X_{m,n})\leftarrow \mathcal {T}_{p}^{(j)}(\tau,m,n)\). For 8-bit plaintext bytes, this model corresponds to the sample means of the leakage sample *τ* corresponding to each value of the plaintext bytes.

*Next*, we compute the Pearson correlation coefficient between this model and the leakage sample in the test sets \(\mathcal {T}_{t}^{(j)}(\tau,m,n)\):

where *m*∈{1,2,3,...,32} and *n*∈{1,2,3,...,16}.

*Last*, the *ρ*-statistic of standard normal distribution is evaluated:

where *N* is the number of EM traces. Since \(\hat {\rho _{m,n}}(\tau)\) satisfies standard normal distribution at any time *τ*, \(|\hat {\rho }_{m,n}(\tau)|>4.5\) can conclude the existence of leakage at *τ* with a much high probability (Schneider and Moradi 2015).

For each *m*∈{1,2,3,...,32} and *n*∈{1,2,3,...,16}, we define \(\rho _{n}^{m}:=\max _{\tau } |\hat {\rho }_{m,n}(\tau)|\), so the following two matrices are obtained:

where for *i*=1,2,3,...,16 and *j*=1,2,3,...,32, \(\xi _{i}^{j}=1\), if \(\rho _{i}^{j}>4.5\); otherwise, \(\xi _{i}^{j}=0\). In addition, the corresponding moments of occurrence is:

where \(\tau _{i}^{j}:=\mathop {argmax}_{\tau }\max _{\tau } |\hat {\rho }_{m,n}(\tau)|\). We define a weak Multi-Thread Leakage Feature (wMTLF) *Λ* based on \(\mathcal {F}'\):

where *Λ*_{n}⊆{1,2,...,32} denotes a set of indices of scalars that equals to 1 in \(\mathcal {F}'[n,\cdot ]\). At the same time, we also define a strong Multi-Thread Leakage Feature (sMTLF) *Ω* based on \(\mathcal {F}'\) and *Υ*^{′}:

where *Ω*_{n} denotes a partition of *Λ*_{n}. The notation *Ω*_{n}{*i*,*j*} denotes the *j*-th element in the *i*-th subset of *Ω*_{n}. For any *i*∈{1,2,...,|*Ω*_{n}|} and *x*,*y*∈*Ω*_{n}{*i*,·}, \(|\tau _{n}^{x}-\tau _{n}^{y}|<\delta \) satisfies, while for any *i*,*j*∈{1,2,...,|*Ω*_{n}|}, *i*≠*j*, *x*∈*Ω*_{n}{*i*,·} and *y*∈*Ω*_{n}{*j*,·}, \(|\tau _{n}^{x}-\tau _{n}^{y}|\geq \delta \) satisfies. *δ* is a threshold that determine whether the leakages of two threads are thought to be synchronous or not.

In our experiment, the *ρ*-statistic of 32 individual tests at each time *τ* are evaluated and plotted within a single figure as Fig. 10. It shows that there are approximately five groups of leakage points marked as POI_{1}, POI_{2}, POI_{3}, POI_{4} and POI_{5}, respectively. Around any of the five groups, multiple colors are accumulated, which seems that the leakages from 32 threads happen at the same time. However, it is not like this when zooming in any of the five groups of leakage points. As is showed in Fig. 11, the detail of POI_{1}, POI_{2}, POI_{4} and POI_{5} tells that not all leakages from multiple threads are synchronous as we usually think. This discovery is so important because we always expect synchronous executions to generate synchronous leakages instead of asynchronous ones. It is obvious that the executions of *T*_{1} and *T*_{2} are almost overlapping, the same with *T*_{17} and *T*_{18}. Since we cannot deny the existence of leakages when there is not any indication in the figure, we still do not know whether any leakage happens or not in other threads except *T*_{1}, *T*_{2}, *T*_{17} and *T*_{18}. We do not know why the device leaks information in these special threads, and we think it may have something to do with the hardware architecture of CUDA-enabled GPU as well as the target software implementation.

### Multi-thread combinational analysis

The above experiments have demonstrated that the EM leakages from multiple threads in a warp is not exactly synchronous. To make full use of the EM leakages from multiple asynchronous threads, we propose two combinational methods.

The first method we propose is Multi-Thread Decision Combinational Analysis (MT-DCA) based on the wMTLF. If the *j*-th bit of a intermediate byte is used to model the leakage of single thread, then the MT-DCA is based on the following leakage model:

where \(\mathcal {L}_{n}^{\Lambda _{n}\{i\}}\) denotes the predicted leakage of the *j*-th bit of the *n*-th intermediate in the *Λ*_{n}{*i*}-th thread.

To recover the *n*-th secret key byte, |*Λ*_{n}| CEMAs should be performed before making a decision:

where \(\rho _{n}^{\Lambda _{n}\{i\}}(\tilde {k})\) is a matrix of Pearson Correlation Coefficient obtained from CEMAs based on the model \(\mathcal {L}_{n}^{\Lambda _{n}\{i\}}\). Obviously, it is feasible to recover the 16 secret key bytes of AES.

The above MT-DCA depends on wMTLF that indicates which of the 32 threads in a warp is leaky rather than whether these leakages are synchronous or not. To make full use of the synchronization of leakages, we propose the second combinational method called Multi-Thread Hybrid Combinational Analysis (MT-HCA) based on sMTLF. If the *j*-th bit of a intermediate byte is used to model the leakage of single thread, then the MT-HCA is based on the following leakage model:

where *w*∈{1,2,...,|*Ω*_{n}{*w*,·}|} and \(\mathcal {L}_{n}^{w}\) denotes the predicted leakage of the *j*-th bit of the *n*-th intermediate in the *w*-th grouped threads whose leakages are thought to be synchronous.

The same way as MT-DCA does, |*Ω*_{n}| CEMAs should be performed before making a decision:

where \(\rho _{n}^{w}\) is a matrix of Pearson Correlation Coefficient obtained from CEMAs based on the model \(\mathcal {L}_{n}^{w}\). Obviously, it is feasible to recover the 16 secret key bytes of AES. More details of MT-HCA is showed in Algorithm 2.

### Experimental results and discussion

Since both MT-DCA and MT-HCA take the advantage of EM leakages from multiple threads, our experiments are performed in *T*32*G*- *S*1*G* encryption mode. MT-DCA and MT-HCA depend on wMTLF and sMTLF, respectively, so we extract the wMTLF and sMTLF of our experimental setting by *ρ*-test before attacking. The extracted wMTLF *Λ* and sMTLF *Ω* are as follows:

that is, *Λ*_{1}=*Λ*_{2}=...=*Λ*_{16}={1,2,17,18} and *Ω*_{1}=*Ω*_{2}=...=*Ω*_{16}={{1,2},{17,18}}.

We compare the performance of five methods showed in Table 1. wPSM-CEMA computes predicted leakages by one thread in wMTLF and sPSM-CEMA computes predicted leakages by one grouped threads in sMTLF. In fact, both wPSM-CEMA and sPSM-CEMA are non-combinational methods.

First, we compare the performance of combinational method with non-combinational ones when wMTLF is available. The experimental results are showed in Fig. 12. wPSM-CEMA is a wMTLF-based non-combinational method and it performs worse than the wMTLF-based combinational method MT-DCA.

Second, we compare the performance of combinational method with non-combinational ones when sMTLF is available. The experimental results are showed in Fig. 13. sPSM-CEMA is a sMTLF-based non-combinational method and it perform worse than the sMTLF-based combinational method MT-HCA.

Both MT-DCA and MT-HCA performs better than their non-combinational counterparts because they make use of multi-thread leakages instead of a single one. It is reasonable that more usable leakages make better performance. In addition, the experimental results show that SSM-CEMA is not effective at all with up to 5,000 traces, because SSM-CEMA does not depend on any leakage feature, thus more noises are introduced when modelling with SSM.

## Conclusions

In this paper, we investigate efficient electro-magnetic analysis of a GPU bitsliced AES implementation in order to give a deep insight into the vulnerabilities of bit-level parallelism and thread-level parallelism. We propose GPU-specific efficient combinational analysis methods and the methods are experimentally proved to be more efficient than the non-combinational ones. Our research suggests that multi-thread leakages can be used to improve attacks if the multi-thread leakages are not synchronous in the time domain.

## Availability of data and materials

Not applicable.

## References

Agrawal, D, Archambeault B, Rao JR, Rohatgi P (2002) The EM side-channel(s) In: Cryptographic Hardware and Embedded Systems - CHES 2002, 4th International Workshop, Revised Papers, 29–45, Redwood Shores. https://doi.org/10.1007/3-540-36400-5_4.

Biham, E (1997) A fast new DES implementation in software In: Fast Software Encryption, 4th International Workshop, FSE ’97, Proceedings, 260–272, Haifa. https://doi.org/10.1007/BFb0052352.

Biham, E, Anderson RJ, Knudsen LR (1998) Serpent: A new block cipher proposal. International workshop on fast software encryption. Springer, Berlin, Heilderberg.

Brier, E, Clavier C, Olivier F (2004) Correlation power analysis with a leakage model In: Cryptographic Hardware and Embedded Systems - CHES 2004: 6th International Workshop Cambridge, MA, USA, August 11-13 2004. Proceedings, 16–29. https://doi.org/10.1007/978-3-540-28632-5_2.

Chari, S, Rao JR, Rohatgi P (2002) Template attacks In: Cryptographic Hardware and Embedded Systems - CHES 2002, 4th International Workshop, Redwood Shores, CA, USA, August 13-15 2002, Revised Papers, 13–28. https://doi.org/10.1007/3-540-36400-5_3.

Di, B, Sun J, Chen H (2016) A study of overflow vulnerabilities on GPUs In: Network and Parallel Computing - 13th IFIP WG 10.3 International Conference, NPC 2016, Xi’an, China, October 28-29 2016, Proceedings, 103–115. https://doi.org/10.1007/978-3-319-47099-3_9.

Duncan, AJ, Creese S, Goldsmith M (2015) An overview of insider attacks in cloud computing. Concurr Comput Pract Experience 27(12):2964–2981. https://doi.org/10.1002/cpe.3243.

Durvaux, F, Standaert F (2016) From improved leakage detection to the detection of points of interests in leakage traces In: Advances in Cryptology - EUROCRYPT 2016 - 35th Annual International Conference on the Theory and Applications of Cryptographic Techniques, Vienna, Austria, May 8-12 2016, Proceedings, Part I, 240–262. https://doi.org/10.1007/978-3-662-49890-3_10.

Durvaux, F, Standaert F, Veyrat-Charvillon N (2014) How to certify the leakage of a chip? In: Advances in Cryptology - EUROCRYPT 2014 - 33rd Annual International Conference on the Theory and Applications of Cryptographic Techniques, Copenhagen, Denmark, May 11-15 2014. Proceedings, 459–476. https://doi.org/10.1007/978-3-642-55220-5_26.

Gao, Y, Cheng W, Zhang H, Zhou Y (2018) Cache-collision attacks on gpu-based AES implementation with electro-magnetic leakages In: 17th IEEE International Conference On Trust, Security And Privacy In Computing And Communications / 12th IEEE International Conference On Big Data Science And Engineering, TrustCom/BigDataSE 2018, New York, NY, USA, August 1-3 2018, 300–306. https://doi.org/10.1109/TrustCom/BigDataSE.2018.00053.

Gao, Y, Zhang H, Cheng W, Zhou Y, Cao Y (2018) Electro-magnetic analysis of GPU-based AES implementation In: Proceedings of the 55th Annual Design Automation Conference, DAC 2018, San Francisco, CA, USA, June 24-29 2018, 121:1?-121:6. https://doi.org/10.1145/3195970.3196042.

Goodwill, JG, Jaffe J, Rohatgi P (2011) A testing methodology for side-channel resistance validation In: NIST non-invasive attack testing workshop, Vol. 7.

Jiang, ZH, Fei Y, Kaeli DR (2016) A complete key recovery timing attack on a GPU In: 2016 IEEE International Symposium on High Performance Computer Architecture, HPCA 2016, Barcelona, Spain, March 12-16 2016, 394–405. https://doi.org/10.1109/HPCA.2016.7446081.

Jiang, ZH, Fei Y, Kaeli DR (2017) A novel side-channel timing attack on GPUs In: Proceedings of the on Great Lakes Symposium on VLSI 2017, Banff, AB, Canada, May 10-12 2017, 167–172. https://doi.org/10.1145/3060403.3060462.

Lim, RK, Petzold LR, Koç ÇK (2016) Bitsliced high-performance AES-ECB on GPUs In: The New Codebreakers - Essays Dedicated to David Kahn on the Occasion of His 85th Birthday, 125–133. https://doi.org/10.1007/978-3-662-49301-4_8.

Luo, C, Fei Y, Luo P, Mukherjee S, Kaeli DR (2015) Side-channel power analysis of a GPU AES implementation In: 33rd IEEE International Conference on Computer Design, ICCD 2015, New York City, NY, USA, October 18-21 2015, 281–288. https://doi.org/10.1109/ICCD.2015.7357115.

Naghibijouybari, H, Neupane A, Qian Z, Abu-Ghazaleh NB (2018) Rendered insecure: GPU side channel attacks are practical In: Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, CCS 2018, Toronto, ON, Canada, October 15-19 2018, 2139–2153. https://doi.org/10.1145/3243734.3243831.

Nishikawa, N, Amano H, Iwai K (2017) Implementation of bitsliced AES encryption on cuda-enabled GPU In: Network and System Security - 11th International Conference, NSS 2017, Helsinki, Finland, August 21-23 2017, Proceedings, 273–287. https://doi.org/10.1007/978-3-319-64701-2_20.

Patrick, CA bitsliced implementation of ECB and CTR AES. https://github.com/conorpp/bitsliced-aes. Accessed Mar 2016.

Schneider, T, Moradi A (2015) Leakage assessment methodology - A clear roadmap for side-channel evaluations In: Cryptographic Hardware and Embedded Systems - CHES 2015 - 17th International Workshop, Saint-Malo, France, September 13-16 2015, Proceedings, 495–513. https://doi.org/10.1007/978-3-662-48324-4_25.

## Acknowledgements

This work was supported in part by National Natural Science Foundation of China (No. 61632020, UI936209) and Beijing National Science Foundation (No. 4192067).

## Funding

Not applicable.

## Author information

### Authors and Affiliations

### Contributions

YG provided the general idea of the work and write the whole manuscript, YZ provided valuable advices on the organization and expression of the paper; WC gave a great help on the experimental setups and the optimization of analysis methods. All authors read and approved the final manuscript.

### Corresponding author

## Additional information

### Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Rights and permissions

**Open Access** This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

## About this article

### Cite this article

Gao, Y., Zhou, Y. & Cheng, W. Efficient electro-magnetic analysis of a GPU bitsliced AES implementation.
*Cybersecur* **3, **3 (2020). https://doi.org/10.1186/s42400-020-0045-8

Received:

Accepted:

Published:

DOI: https://doi.org/10.1186/s42400-020-0045-8

### Keywords

- GPU-based cryptographic implementations
- Side-channel analysis (SCA)
- Electro-magnetic attacks (EMA)
- Micro-architectural vulnerabilities
- Combinational analysis