A secure and highly efficient first-order masking scheme for AES linear operations

Due to its provable security and remarkable device-independence, masking has been widely accepted as a noteworthy algorithmic-level countermeasure against side-channel attacks. However, relatively high cost of masking severely limits its applicability. Considering the high tackling complexity of non-linear operations, most masked AES implementations focus on the security and cost reduction of masked S-boxes. In this paper, we focus on linear operations, which seems to be underestimated, on the contrary. Specifically, we discover some security flaws and redundant processes in popular first-order masked AES linear operations, and pinpoint the underlying root causes. Then we propose a provably secure and highly efficient masking scheme for AES linear operations. In order to show its practical implications, we replace the linear operations of state-of-the-art first-order AES masking schemes with our proposal, while keeping their original non-linear operations unchanged. We implement four newly combined masking schemes on an Intel Core i7-4790 CPU, and the results show they are roughly 20% faster than those original ones. Then we select one masked implementation named RSMv2 due to its popularity, and investigate its security and efficiency on an AVR ATMega163 processor and four different FPGA devices. The results show that no exploitable first-order side-channel leakages are detected. Moreover, compared with original masked AES implementations, our combined approach is nearly 25% faster on the AVR processor, and at least 70% more efficient on four FPGA devices.


Introduction
Side-Channel Attacks (SCAs) exploit physical leakages (e.g. running time (Kocher 1996), power consumption (Kocher et al. 1999) and electromagnetic radiations (Quisquater and Samyde 2001)) of a cryptographic implementation to recover the corresponding secrets. During the past two decades, SCAs have been proved to be serious threats to practical security of cryptographic devices, like CPUs, GPUs, smart cards, ASICs and FPGAs. The recent Meltdown and Spectre (Prout et al. 2018) attacks are typical examples of utilizing cache running time to steal secret data. As a consequence, countermeasures against SCAs must be developed and applied to protect the secret information. In fact, the necessity of protection against SCAs has been reached a broad consensus in the industry. There have been international standards that require cryptographic modules to protect against SCAs. For example, FIPS 140-3 (FIPS Publication 140-3 2019) and ISO/IEC 17825:2016(JTC 2016, which both are security standards for cryptographic modules, claim that cryptographic modules with high level security should concern with the mitigation of SCAs. Obviously, designing secure (symmetrical) cryptography implementations against SCAs has become one of the most important hot spot in physical security.
Among different kinds of SCAs, first-order attacks, such as first-order differential power analysis (DPA) (Kocher et al. 1999) and first-order correlation power analysis (CPA) (Schneider and Krishnamoorthy 1996), are widely used due to their low costs. Thus, they are the first into considerations when guarding devices against SCAs.
ISO/IEC 17825:2016 specifies the use of Test Vector Leakage Assessment (TVLA) framework as a certain measure to assess whether an implementation of symmetrical cryptography is vulnerable to first-order attacks or not. According to the highest security level, namely ISO/IEC 17825 level-4, there should be at least 100,000 traces collected to test whether a specific implementation is resistant to them.
Due to its provable security and remarkable deviceindependence, masking (Rivain et al. 2009) has been one of the most wide-adopted countermeasures against SCAs (Duc et al. 2019). Specifically, masking countermeasure randomizes the subtle dependency between a sensitive intermediate values and its corresponding side-channel leakages by splitting the sensitive intermediate into several shares. Usually, it is called d th -order masking when the sensitive intermediate is split into d+1 shares to thwart d th -order SCAs (Ishai et al. 2003). Namely, a first-order masked implementation should be able to protect against first-order attacks. However, the main drawback of a masked implementation is that its cost is relatively high compared with the unprotected counterpart. In order to make masking countermeasure more practical, it is worthwhile to reduce its cost as much as possible.
Since relatively high complexity of non-linear operations, most studies focus on improving non-linear operations for masking schemes (Rivain and Prouff 2010). Specifically, for symmetric cryptographic algorithms such as AES, most studies focus on increasing the S-box efficiency by improving the look-up table (LUT) operations (Coron 2014;Coron et al. 2018) or mending polynomial computation over the finite field ). However, the security and efficiency of linear masking schemes (For sake of simplicity, the masking scheme of linear operations is called linear masking scheme in the rest of this paper) are widely ignored. In this paper, we focus on the security and efficiency of masked AES linear operations, and we find linear part should not be ignored. As for security, we find that there are serious flaws in MixColumns operations in some masking schemes because correlative masks are adopted in each round. As for efficiency, to figure out which part of operations account more in whole cryptographic computation, we evaluate the running time proportions of non-linear and linear operations in state-of-the-art masking schemes. Due to the primacy of first-order security, we evaluate those in four different first-order masking schemes firstly, and these masking schemes are described in details as follows.
SP. SP (Schramm and Paar 2006) scheme is designed to be secure at any order d. This scheme is now regarded to be first-order and second-order secure, as an third-order attack against it was shown in (Coron et al. 2007). Additionally, they claimed that it is sufficient to use a single 8-bit mask for the entire masked AES algorithm.
ASCAD. ASCAD (Prouff et al. 2018) is a public dataset for the study of side-channel analysis, which is also implemented by a first-order masking AES scheme. So we use the same name to represent this masking scheme in this paper.
RSMv1. RSM (Nassar et al. 2012) is a low entropy masking scheme, which aims at keeping performances and complexity close to unprotected AES design while being as robust against first-order SCAs as other masking in cryptographic implementations. In this paper, we use RSMv1 to denote the first vision of RSM. Note that this masking scheme was adopted by DPA Contest v4.1 as a study case for side-channel security. Moreover, it has been proven in (Veshchikov and Guilley 2017) that the linear part of RSMv1 has a first-order flaw, and similar flaw also happens in SP scheme.
RSMv2. RSM was revisited in (Bhasin et al. 2014), and we use RSMv2 to denote this improved RSM scheme. Actually, RSMv2 was adopted by DPA Contest v4.2, and is one of the most popular first-order masking schemes.
All benchmarkings have been done on an Intel Core i7-4790 CPU. The results are shown in Fig. 1.
Moreover, we evaluate the running time proportions of non-linear and linear operations in three higher order masking schemes (proposed in (Rivain and Prouff 2010;Coron 2014;Coron et al. 2018), and denoted by RP, Cor and CRZ respectively in this paper) with d=1 and d=2 in same experimental setting. The results are shown in Fig. 2.
It can be seen that linear operations account for at least 70% in these four first-order masking schemes, while they account for less than 20% in second-order masking schemes. Note that complexity of non-linear part grows much faster than that of linear part with increasing d, so linear operations would account for lower in whole computations when masking order d gets higher than two. Considering that first-order security is primacy in masking countermeasure and linear operations are relatively important in first-order masking, in this paper we mainly  focus on the optimization of linear operations to make first-order masking countermeasure secure and more efficient.
Our Contribution. In this paper, we find that AES linear operations are widely ignored in state-of-the-art first-order masking countermeasures. Specifically, these masking schemes are insecure when correlative masks are adopted in each round, and inefficient because some linear operations are called redundantly. Consequently, we propose a secure and highly efficient first-order linear masking scheme for AES, then heuristically prove its security and critically analyze its efficiency. In fact, this linear masking scheme can be combined with existing non-linear masking schemes, which not only maintains the original security level, but also further improves the implementation efficiency. As a concrete illustration, we apply it to four state-of-the-art AES masking schemes on an Intel Core i7-4790 CPU by replacing their linear operations with our proposal. The results demonstrate that by combining our proposal, these popular masking schemes are approximately 20% faster compared to the original ones. Furthermore, we combine a well-studied masking scheme named RSMv2 with our proposal, then evaluate its practical security, and carry out a detailed performance comparison on five different embedded devices.
The results of security evaluation verify that there is no available first-order leakage in both software and hardware implementations. In addition, the results of efficiency evaluation show that the implementations combined our proposal are nearly 25% faster than existing implementations on AVR ATMega163 Processor, and the efficiency is improved by more than 70% on four different FPGA devices.
The rest of our paper is organized as follows. Sec. 2 reviews the basic knowledge of masking schemes for AES, and Sec. 3 illustrates the issues of the masking schemes in their linear operations. In Sec. 4, we propose our new first-order masking scheme for AES linear operations and analyze its security and efficiency in theory. Then in Sec. 5 we evaluate the performance of our proposal on different implementations. Finally, Sec. 6 concludes the paper.

Preliminary
In this section, we review the existing masking schemes for AES. In masking schemes, each sensitive value is split into d + 1 shares x 0 , x 1 · · · x d , and their relation can be expressed as Eq.(1). x where x means the sensitive value while x i means i-th split share, and ⊥ denotes the mask operation. In the rest of the paper, we shall consider that ⊥ is the exclusiveor (XOR) operation denoted by ⊕. Usually, the d shares {x 1 , · · · , x d } called masks are randomly picked up and the {x 0 } called the masked value is processed such that it satisfied Eq. (1). Recently, most studies focus on designing the masked S-box with d + 1 shares, such as d th -order masking scheme based on LUT (Coron 2014), based on polynomial computation over the finite field (Coron et al. 2014) and so on. But the designing for linear operations is widely ignored. As illustrated above, the costs of linear operations account much more than that of non-linear operations in first-order masked implementations. Thus in this paper, we focus on first-order masking scheme, and accordingly each sensitive value is split into two shares (d=1). Recently, a common masking approach for linear operations is to compute each share successively (Schramm and Paar 2006;ParisTech 2015;Coron et al. 2014), as shown in Fig. 3. First of all, 16 S-boxes are processed respectively, and the input 8-bit value of each S-box is split into 2 shares, so each S-box is fed with 2*8 bits. Next, all 16*8 bits of each share are fed to two linear oper- Thus, it is obvious that this masking scheme for linear operations will cost 2 times as much as unprotected AES implementation. In this paper, to distinguish the two shares and sensitive value clearer, x 0 , x 1 and x are expressed as ST, M and V respectively. Then we consider the masked AES implementation, while a general masked AES-128 implementation is shown in Algorithm 1, with the following linear operations: AddRoundKey. AddRoundKey is a Boolean operation, which XOR the state data with round keys. As for Boolean masking schemes, the round keys need to do XOR with only one share (generally ST), so this operation costs the same in masked implementation with unprotected one.

Remask.
Remask is not a necessary operation for all first-order masking schemes. Specifically, remask operation refreshes the two shares ST and M without changing ST ⊕ M (it is different with mask refresh, since the new M may be not generated randomly). For example, the masking scheme in (Prouff et al. 2018) needs to remask before and after MSbox, which corresponds to the line 6, 8, 15 and 17 in Algorithm 1. And the masking schemes in (Nassar et al. 2012;Bhasin et al. 2014) need to remask after MixColumns, which corresponds to the line 13 in Algorithm 1.
ShiftRows. ShiftRows is necessary for AES. In the existing first-order masking schemes, ShiftRows operation executes once for each share in one round, as shown in line 9 and line 10 in Algorithm 1.
MixColumns. MixColumns is also necessary for masked AES schemes. MixColumns is usually expressed as several sequences of multiplications by 2 over F 2 8 and XOR operations. Specifically, it can be expressed as Eq. (2).
where i ∈ {0, 1, 2, 3}. Actually, MixColumns can be implemented as Eq. (2) directly, which is an original and trivial implementation but is still adopted by (Coron 2014;Coron et al. 2018). Then it was improved in (Fang 2009), which is shown as Algorithm 2. The GF256MUL2(·) operation denotes the multiplication by 2 over F 2 8 . Actually, this improved MixColumns module has been adopted in (Nassar et al. 2012;Bhasin et al. 2014;Prouff et al. 2018 for j = 0 to 3 do 4: end for 7: end for However, there are security flaws and redundant calls in the linear operations, which will be explained in next section.

Security and efficiency of linear operations
In this section, we illustrate the flaws of existing masked AES implementations in linear operations. Then we show a protected algorithm as an example, which satisfied firstorder secure but not efficient. This example could help to understand our proposal described in next section.
In first-order masking schemes, each sensitive intermediate should be randomized by at least one randomness. Certainly, the output of a linear operation (e.g. XOR) should also be randomized, so we have Eq. (3).
Two sensitive intermediates V 0 and V 1 are randomized by masks M 0 and M 1 separately, while masked values are denoted by ST 0 and ST 1 . Then, the output (V 0 ⊕ V 1 ) is randomized by (M 0 ⊕ M 1 ). If these masks are correlative (e.g. in SP scheme, M 0 = M 1 ), then (V 0 ⊕ V 1 ) are not randomized enough and this masking scheme cannot reach first-order security. Similarly, in a masked AES scheme, these linear operations cannot reach first-order security while the masks in each round are correlative. Specifically, in MixColumns module, the different masked values (ST i ) get XOR without any new random number introduced. Thus if these intermediates are protected by correlative masks, the new intermediate (their XOR results) may no longer be protected.
It can be verified by simulated experiments. We simulated the leakages of one loop in Algorithm 2 using a very popular leakage model (Ming et al. 2020), which can be expressed by L = HW (x) + N, where L denotes the leakages and HW (x) denotes Hamming weight of the intermediate x. N denotes Gaussian noise and we use σ to denote its standard deviation. σ is set to 2 in all simulated experiments. We simulated 200 points totally, and 19 points among them are corresponding to 3+4× 4=19 operations in one loop in Algorithm 2 while other points are simulated by Gaussian noise. We launch t-test (Gilbert Goodwill et al. 2011) to detect the first-order leakages on these simulated points, and the results are shown in Fig. 4a. It can be seen that first-order leakages are really obvious. Additionally, this flaw is also found in RSMv1. In this scheme, third bit of masks after XOR operation are totally the same, so the third bit of output is not randomized. Utilizing this unprotected bit, first-order leakages can also be detected, which are shown in Fig. 5b. Similar results were illustrated in (Veshchikov and Guilley 2017), which utilized this unprotected bit to launch first-order attack on DPA contest v4 dataset, and successfully recovered the secret key. However, they thought this attack worked due to improper mask set, but the implementation with their revisited mask sets is still first-order attackable (Ming et al. 2020).
We show one protected example to make our proposal easier to understand. Specially, we generate a new randomness s and introduce it to line 2 and line 4 in Algorithm 2 to protected sensitive intermediates, and the protected example is shown in Algorithm 3. It is easy to verify that the inputs and outputs of Algorithm 3 are the same as those of Algorithm 2, since randomness s is cancelled out before output.
It is also trivial to verify the security of Algorithm 3. We denote by I the set of intermediate variables that processed during an execution of Algorithm 3, and we denote by EI the set of extra variables due to accumulation in an intermediate variable. Table 1 lists these variables of Algorithm 3. In order to prove Algorithm 3 is secure against first-order SCA, we need to show that all variables are protected at least by one randomness. Specifically, I 1 and I 2 are straightforwardly secure. I 3 , I 4 and I 5 are protected by s  Since t (resp. tmp) is still protected by s (resp. GF256MUL2(s)) which is independent with ST, the two sets EI 1 and EI 2 are secure as well. Note that the output OT might be attackable since s is removed. However, if the 16 masks M[ 16] are the same, OT will be still protected well. Because the masks M are also removed in line 3 and line5, so GF256MUL2(tmp) and t are only protected by s, which is removed in line 6. Then OT will be under the same protection of the masks of ST, namely M.
In addition, we simulate the points for each operation in protected example with 16 same masks and detect their first-order leakages. There are 200 simulated points while 24 points among them are corresponding to 24 operations in one loop in Algorithm 3. The detection results for protected SP and protected RSMv1 are shown in Figs. 4b and 5b respectively. It can be seen that no first-order leakages detected after protection. And our proposal described in next section is an improvement from this example.
On the other hand, these linear operations usually become inefficient because some linear operations are called repeatedly and redundantly. Only considering the necessary operations, the ShiftRows and MixColumns are called for each shares, as shown in line 9, line 10 and line 11, line 12 in Algorithm 1. Clearly, calling linear operations repeatedly is surely satisfied with correctness for a linear function, since F(ST ⊕ M) equals to F(ST) ⊕ F(M) for a linear function F(·). However, the masks M are randomly picked up, and linear operations for M seem wasteful. Thus, it is important to design a secure linear algorithm without changing the masks M while ST is updated, then the linear operations for masks M can be saved.

Linear masking scheme
In this section, we propose a secure and highly efficient first-order linear masking scheme for AES, then analyze its efficiency and prove its security.

Core idea
In fact, designing two linear operations is a waste of resources for masked AES implementation. Therefore, we introduce the idea that: reduce two masked linear operations in Fig. 3 to one while extra randomness is induced to maintain the security level, as shown in Fig. 6. Consequently, the efficiency on linear operation can be considerably increased by a factor of two.
Following this core idea, we propose our first-order linear masking scheme.

First-order linear masking scheme
As shown above, there are 3 necessary modules of the linear operations in AES: AddRoundKey, ShiftRows and MixColumns. Among them, AddRoundKey and ShiftRows have been secure and efficient. As for AddRoundKey, only one share needs to do XOR with corresponding round key, which cost the same with unprotected AES. As for ShiftRows, we can just reorder the computation of 16 S-boxes (Bhasin et al. 2014) to save Fig. 6 The process of our proposed first-order masking scheme in one AES round half costs. However, designing the masked MixColumns module is the most complex and significant for improving the linear operation since it costs the highest in these modules.
The masked MixColumns is shown as Algorithm 4, which is called SEMixColumns in this paper to distinguish it from others. In order to combine two MixColumns into one while maintaining the same security, new generated random numbers (line 1) are needed to protect the intermediates, then part of masks can be removed. Finally, these random numbers s are removed in line 10 while this line is protected by the masks of ST 4i+j , and the outputs of Algorithm 4 (16 bytes OT[ 16]) are still protected by M, so the MixColumns for M in Algorithm 1 (line 12) can be saved.  Thus, line 11 and line 12 in Algorithm 1 are replaced by "ST ← SEMixColumns(ST, M)" in our masked AES algorithm, while other parts are all the same. Note that the ST should be cleared to zero before overwriting with OT, otherwise there will be extra leakages corresponding to the sensitive intermediate GUL2MUL(tmp) ⊕ t. One MixColumns module is saved by the optimization, although SEMixColumns costs more since new variables are introduced. It can be counted that there are totally 20 GF256MUL2 and 120 XOR in Algorithm 4. Since Algorithm 4 is only called once in masked AES implementation, it costs less than other algorithm in sum. In order to make clearer comparison of the costs on linear operations, we list the required number of basic operations (XOR, Shift and GF256MUL2) of these linear implementations in one AES round, as shown in Table 2.

Algorithm 4 Secure and efficient proposal of one Mix-Columns (called SEMixColumns in this paper
It can be seen in Table 2 that our proposal is advantaged particularly on the number of GF256MUL2. Our proposal only cost 20 GF256MUL2 in each AES round. So the implementation for GF256MUL2 certainly affects the efficiency gain though our proposal. Moreover, Table 2 is only a theoretical comparison, the practical improvements may be different more or less due to different implementation targets.

Security analysis
In this section, we prove that our proposed masking scheme can actually reach first-order security against SCAs.
First of all, other linear parts except MixColumns have been proven to reach first-order security. As for Mix-Columns, both original implementation (Coron et al. 2018) and efficient implementation (Veshchikov and Guilley 2017) will be threaten by first-order attacks if the masks in each round are correlative, which has been demonstrate in Sec. 3. So we need to prove that our proposed SEMixColumns (Algorithm 4) is able to protect against first-order attacks, namely each sensitive value is protected by a random number.
We heuristically prove the first-order security of Algorithm 4 with the approach similar to that in Sec. 3. We also denote by I the set of intermediate variables that processed during an execution of Algorithm 4, and we denote by EI the set of extra variables due to accumulation in an intermediate variable. Table 3 lists these variables of Algorithm 4. Specifically, I 1 , I 3 , I 4 , I 5 and I 7 are the same as the sets in Table 3, which have been proved to be secure. And I 2 and I 6 are also protected by s or GF256MUL2(s). Since random numbers s are removed in line 10 while ST is protected by the masks M, the outputs I 8 are still protected by M. In addition, if t (resp. tmp) is not cleared to zero before overwriting with ST[ 4i] (resp. ST[ 4i + j]), there will also be two set of extra variables, which are denoted by EI 1 and EI 2 in Table 3. Similarly, EI 1 and EI 2 are protected by s or GF256MUL2(s) as well.
In summary, all sensitive intermediates in linear operations are still under protection, which proves that our proposal for linear operations can theoretically achieve first-order security.

More efficient implementation in certain cases
Our proposal as shown in Algorithm 4 can be optimized further in certain cases. Specifically, if Eq. (4) is satisfied, the masks in each column after ShiftRows are equal (the same color in Fig. 7), then r 0 , r 1 , r 2 and r 3 in Algorithm 4 equal to 0. Therefore, there is no need for the XOR operations related with these intermediates, which can save an additional 40 XOR per round.
However, the optimized method cannot be adopted by all first-order masking schemes since: (1) this method requires the masks to be a regular sequence, which is not compatible with this more efficient implementation).
(2) after the Mixcolumns, it is necessary to remask to ensure continuous implementation process (as shown in Fig. 7). So the masking scheme should contain the remask operation after Mixcolumns (line 13 in Algorithm 1), or this optimized method would not work for the scheme duo to extra XOR introduced.

Experiment
In this section, we combine our linear masking algorithm with different first-order masking schemes, and evaluate their practical efficiency in a CPU implementation. Next, we choose the scheme named RSMv2, which have been studied well because of DPA Contest v4.2, as our main target. Specifically, RSMv2 is implemented in both software and hardware, then its side-channel security and practical efficiency are evaluated later.

Evaluation of running time on CPU
A masked AES implementation is consist of two parts: non-linear part and linear part, so the speedup of our proposal toward whole AES execution time is also affected by the implementations of non-linear part. Generally speaking, if the non-linear part is well-designed and efficient, the improvement of our proposal will get more remarkable. In this experiment, we combine our proposal with four state-of-the-art first-order masking schemes, which have been introduced in details. These schemes are implemented on an Intel Core i7-4790 CPU running at 3.60GHz. Considering that the function GF256MUL2 also affects speedup of our proposal, we implement GF256MUL2 by two methods: (1) Computation (slower but save memories). Rotate the 8-bits input to the left by 1 bit, then judge whether the most significant bit equals to 1. If so, then XOR this intermediate with 0x1b.
(2) LUT (faster but need extra memories). Look up the outputs in a stored table. The results of speedup in different first-order masking schemes are shown in Table 4. It can be seen that our proposal can get approximately 20% speedup on these masking schemes. Moreover, for each masking scheme, the speedup of our proposal is greater while GF256MUL2 is achieved by computation, because computation based GF256MUL2 requires more running time than LUT based one, which lead to more cost savings by using our proposal.
Moreover, we discuss about how the compiler treats these software implementations, and the details are shown in Appendix.

Evaluation of embedded implementations
Considering that RSMv2 has been studied well because of DPA Contest v4.2 and its software implementation has been published in (ParisTech 2015), we select this popular first-order masking scheme as our study object. And in our experiments, we replaced the linear operations by our more efficient proposal while keeping the non-linear operations unchanged.
As for software implementation, we adopt the similar RSMv2 implementation as used in DPA Contest v4.2. We programmed the hex-files on a FunCard with an Atmel ATmega 163 micro-processor, which is the suggested platform from DPA Contest v4.2. The implementation codes of RSMv2 scheme can be also downloaded from the DPA Contest v4.2 (ParisTech 2015), which uses C Language on connection between the processor and computer, and uses assembly language on cryptographic implementation. To take a fair comparison, we only use assembly language to change the linear operations. Note that GF256MUL2 is implemented by LUT in the published codes, and we implement this function in the same way.
Compared to the RSMv2 implementation, the code size of our proposal gets similar. Actually both of these two hex-files are 42KB. As for execution time, our proposal gets obviously faster. The FunCard is connected with computer by SASEBO-W board, and we measured the running time of RSMv2 and our proposal with a trigger in codes and an oscilloscope. Since ATmega 163 micro-processor is running at 8MHz, it is easy to get the results on the software implementation in cycles, which are show as Table 5. The results show that using our proposal the masked AES is nearly 25% faster than before on the software implementation.
As for hardware implementations, the RSMv2 scheme in (Bhasin et al. 2014) and our proposal are both implemented in Kintex-7, Virtex-6, Virtex-5 and Spartan-3E FPGA devices, the consumed resources for masked AES are shown in Table 6. Since these AES implementations take 128-bit inputs and outputs, the data size is 128 bit. And the latency of these implementations is 14 cycles. The function GF256MUL2 can be implemented by different methods, which affects the speedup of our proposal. 'Computation' in this line means that GF256MUL2 is implemented by shift and XOR operations, 'LUT' means it is implemented by look-up table.
While getting the speedup for each masking scheme, we compare our proposal with first-order secure and the most efficient implementation The delay column represent the longest path between registers. Note that non-linear part is implemented using fixed BRAM costs, which is a common and popular implementation (Nassar et al. 2012). As for GF256MUL2, it is implemented by computation on FPGA. Above all, it can be seen that our proposal greatly outperforms the RSMv2 scheme in efficiency on these four different FPGA devices, and the efficiency are all increased by more than 70%.
• On Kintex-7 device, which is built on 28-nm process technology, our proposal achieves better performance with 70.78% efficiency gain compared to existing implementation. • Virtex-6 is also built on 28-nm process technology, and the performance on it gets worse (more delays and LUTs) than that on Virtex-6 device. But our proposal still achieves better performance with 71.35% efficiency gain, which is similar to the result on Kintex-7 device. • Virtex-5 is for integration at 56-nm, but the performance on it is similar with that on Virtex-6, and our proposal achieves better performance with 80.60% efficiency gain, which is the most in these four devices. • Spartan-3E is built on 90-nm process technology. On Spartan-3E device, our proposal can also achieve better performance with 74.38% efficiency gain.
An interesting finding is that the delays and LUTs on Spartan-3E are much more than other three devices. We think it is because Kintex-7, Virtex-6 and Virtex-5 are 6input LUTs based FPGA, while Spartan-3E devices is 4input LUTs based one. Therefore, the implementation on Spartan-3E requires more LUTs (almost 50% than others), and accordingly needs more running time.
From the results of these experiments, our proposal for linear operations of AES masking schemes greatly improves the efficiency of masked AES on both hardware and software cryptographic implementations.

Side-channel security
To evaluate the practical security of our proposal, we launch t-tests (Gilbert Goodwill et al. 2011) and first-order CPA (Rivain 2008) to both software and hardware implementations respectively. The traces for software implementation are collected from an Atmel ATmega 163 micro-processor on SASEBO-W board with an Agilent DSO9104A oscilloscope. The sampling rate is set to be 20MS/s, and the collected traces are compressed with a step of 4 points. The traces for hardware implementation are collected from a Xilinx Kintex-7 family device XC7K160T-1FBG676 embeded on SAKURA-X board with also an Agilent DSO9104A oscilloscope. The sampling rate is set to be 1GS/s.
We use leakage detection test to evaluate whether any first-order side-channel leakages existing in our proposal. Specifically, We apply first-order t-test on both software implementation and hardware implementation. For software implementation, we totally collect 200,000 traces, each trace has 12,000 points and is compressed to 3,000 points. Considering that FPGA implementation is faster and gets more noise, we totally collect 560,000 traces, each trace has 750 points. The results of first-order t-tests are shown in Fig. 8. It is obvious that the first-order leakages of two implementations are always lower than ± 4.5, which means our proposal has no exploitable first-order leakages on these two implementations.
In Fig. 8a, the leakages near No.400 point get more obvious than other points. Actually, it is the operations for data  First-order attacks on software RSMv2 implementation after combining our proposal input (reading plaintexts), so these points cannot be used to recover the secret key, and the following attack results also demonstrate it. We used 200,000 traces to launch a first-order attack on the software implementation, and used 560,000 traces to launch first-order attack on hardware implementations. Figures 9 and 10 show the success rate and guessing entropy results (Rivain 2008) for these two implementations. We can see that the success rate always equals to zero in both implementations, and there is no downward trend in the guess entropy results.
Above all, all experimental results show that our proposal can actually provide first-order security to the software and hardware implementations.

Application in higher order masking schemes
Our proposal performs well in first-order masking schemes in different implementations. However, we do not think it can achieve similar performance in higher order masking implementations.
We have shown the running time proportion of linear operations in different AES higher order masking implementations in Fig. 2, and we find that the significance of the linear part progressively decreases with the increase of the masking order. Take CRZ (Coron et al. 2018) higher order masking scheme as an example, which is state-ofthe-art provably secure masked implementation for AES at any order d. When the order of CRZ masking scheme exceeds two, linear operations account for less than 3% of the total AES implementation. In fact, the evaluations on other higher order masking schemes get similar results. Therefore, we think that it is of little meaning to improve the efficiency of AES higher order masking schemes by optimizing their linear operations. Namely, it should be still focused on non-linear operations to improve the higher order masking schemes.

Conclusion
In this paper, we point out that there are some design defects in the linear operation of state-of-the-art masked AES implementations. These defects not only decrease the security level of some popular first-order masking schemes, but also reduce their implementation efficiency. In order to remedy these defects, we propose a first-order linear masking scheme with inspired performance, then Fig. 10 First-order attacks on hardware RSMv2 implementation after combining our proposal prove its security and analyze its efficiency. Specifically, this masking scheme is designed for AES linear operations, so it can be combined with any existing masked AES scheme to further improve the implementation efficiency. To demonstrate its practical security and efficiency, we combine our proposal with four different state-of-theart first-order masking schemes on an Intel Core i7-4790 CPU. The results show that the four masking schemes can get roughly 20% faster than before by using our proposal. Moreover, we select a well-studied masking scheme named RSMv2, and implement it on five different embedded devices. The results show that by using our proposal, the masking implementations achieve much higher efficiency without exploitable first-order leakages.

Appendix
To figure out how the compiler treats the software implementation, we optimize the implementations with the help of "-O3" option. Then we measure the performance of all popular first-order masking schemes on an Intel Core i7-4790 CPU again. The results are shown in Table 7. It can be seen that the speedup is not impressive under this environment.
We additionally take ARM as embedded software target. Specifically, we evaluate the performance of unprotected AES and four masking schemes on a STM32F4 MCU based on ARM Cortex-M4. Thanks to the simulator in Keil uVision5, the performance results of all the schemes are straightforwardly in cycles, as shown in Table 8. It can  be seen that our proposal is nearly 15% faster on the typical 32-bit processor, which is not as much as the benefis on 64-bit Intel processor. Namely, the performance gains are different due to implemented compiler.