In-depth Correlation Power Analysis Attacks on a Hardware Implementation of CRYSTALS-Dilithium

During the standardisation process of post-quantum cryptography, NIST encourages research on side-channel analysis for candidate schemes. As the recommended lattice signature scheme, CRYSTALS-Dilithium, when implemented on hardware, has seen limited research on side-channel analysis, and current attacks are incomplete or requires a substantial quantity of traces. Therefore, we conducted a more complete analysis to investigate the leakage of an FPGA implementation of CRYSTALS-Dilithium using the Correlation Power Analysis (CPA) method, where with a minimum of 70,000 traces partial private key coefficients can be recovered. Furthermore, we optimise the attack by extracting Point-of-Interests using known information due to parallelism (named CPA-PoI) and by iteratively utilising parallel leakages (named CPA-ITR). Our experimental results show that CPA-PoI reduces the number of traces by up to 16.67%, CPA-ITR by up to 25%, and both increase the number of recovered key coefficients by up to 55.17% and 93.10% using the same number of traces. They outperfom the CPA method. As a result, it suggests that the FPGA implementation of CRYSTALS-Dilithium is more vulnerable than thought before to side-channel analysis.


Introduction
With quantum computers, the conventional publickey cryptosystems, such as RSA, DSA, etc., can be broken by Shor's algorithms (Shor 1994) without much effort.In response to the threats, National Institute of Standards and Technology (NIST) initiated a postquantum cryptography (PQC) standardisation process in December 2016 (Moody 2016), and finally announced four algorithms for standardisation in July 2022, with CRYSTALS-Kyber (Avanzi et al. 2019) being selected as a post-quantum Key Encapsulation Mechanism (KEM), and CRYSTALS-Dilithium (Dilithium for short) (Ducas et al. 2018), Falcon (Fouque et al. 2018), and SPHINCS+ (Bernstein et al. 2015) being selected as post-quantum signature schemes.Amongst the signature schemes, NIST primarily recommends Dilithium, as it believes this is the primary algorithm for digital signatures.
Dilithium ensures Strong existential Unforgeability under Chosen Message Attacks (SUF-CMA), and its security is guaranteed by lattice-based hard problems.The security of cryptographic algorithms theoretically lies on their mathematical structures, but in practice, they are also under threats of side-channel attacks (SCAs).The importance of side-channel security for PQC is also emphasised in the NIST PQC standardisation process.Existing side-channel attacks for implementations of Dilithium typically target the random number generation, Number-Theoretic Transform (NTT), or polynomial multiplication operations.Side-channel leakage from these operations has been investigated using various methods, including Simple Power Analysis (SPA), Correlation Power Analysis (CPA) (Brier et al. 2004), and profiled attacks (Chari et al. 2002).However, those research mostly focuses on software implementations, having little concern on SCA to hardware implementations of Dilithium due to the hardness (Steffen et al. 2022).In the light of this, we study the security of hardware implementations of Dilithium under side channel attacks.
In this paper, we investigate the vulnerabilites of Dilithium and propose practical side-channel attacks to recover the private key of Dilithium by analysing a typical Field-Programmable Gate Array (FPGA) implementation of Dilithium.

Related work
Typical implementations of Dilithium is based on ARMs or FPGAs.For ARM-based implementations.Ravi et al. (2018) proposed a non-profiled side-channel attack targeting polynomial multiplication, by which partial private key was extracted and utilized to forge signatures.However, they believe that the attack goes beyond polynomial time.Subsequently, Chen et al. (2021) proposed a conservative CPA method to reduce the key guessing space and a fast two-stage method to further reduce the guessing space for attacking polynomial multiplication operations.As a result, they were able to fully recover the private key using the conservative CPA method with only 157 power traces.The hybrid method, combining the fast two-stage and conservative CPA, saved the attack's execution time by 87%.Qiao et al. (2023) proposed a new non-profiled attack method, called Public Template Attack (PTA), on both unprotected and protected implementations of Dilithium (Migliore et al. 2019), targeting the random polynomial y , successfully recovering the private key.Non-profiled attacks focus primarily on polynomial multiplication, while profiled attacks focus primarily on NTT operations.Primas et al. (2017) proposed a generic method targeting NTT operations that requires establishing templates for all possible multiplications in butterfly operations, which is costly.Han et al. (2021) employed a machine learning-based method to attack NTT operations, recovering keys using 60,000 power traces.Berzati et al. (2023) reconstructed a given coefficient in a predicted vector to determine if it is zero, thus recovering the private key using linear algebra methods, with 700,000 power traces.
For FPGA-based implementations, Steffen et al. (2022) proposed elector-magnetic analysis targeting polynomial multiplication in Dilithium.They conducted two main attacks.Firstly, the profiled SPA method was used to attack the Decode and the first-stage NTT operations.
Due to the parallelism of multiple key coefficients in the implementation, the authors assume that the adversary can fix for all key coefficients but one.The adversary capability is too strong.Secondly, the CPA method of least significant bit (LSB) is used to attack the polynomial multiplication implementation, and theoretically for the hardware implementation, multi-bits model can depict the information leakage more accurately, and the Hamming distance model is more suitable for the attack of the hardware implementation.Although it attacks both ct 0 and cs 1 , ct 0 does not affect the security of the Dilith- ium scheme.It only recovered one key coefficient for cs 1 using 1,000,000 electromagnetic traces, and the Pearson correlation coefficient for the correct key guess was not significant.
At present, of the attacks targeting polynomial multiplication of Dilithium, Chen et al. 's attack (Chen et al. 2021) performs best.However, due to the pre-charging mechanism of CMOS circuits in ARM platform, the Hamming weight model is selected as the power consumption model.While the attack on the FPGA implementation needs to be selected according to the specific implementation, which is generally the Hamming distance.In addition, due to the parallelization of the hardware implementation, the signal-to-noise ratio is lower than that of the ARM implementation.Ma et al. (2022) attacks the two different Kyber implementations: one with three multiplications in parallel and the other with a single multiplication.The results show that attacking the three multiplications parallel implementation uses four times as many power traces as the other one.From Stefffen et al. 's work (Steffen et al. 2022), it can be seen that FPGA implementation of Dilithium has more parallel operations than the ARM implementation, such as keccak operation and parallel computation of multiple key coefficients, and it takes more time to attack it, so it is difficult to perform a complete analysis.The fast two-stage scheme of Chen et al. 's work (Chen et al. 2021) reduces the attack time, but the best key recovery can be achieved when the power trace used is 63 times that of the conservative CPA scheme, while the recovery of one key coefficient of the cs 1 in the attack of Steffen et al. (2022) has used 1,000,000 electromagnetic traces, and the increase in the dataset of the fast two-stage scheme will be very significant, which is relatively high for the memory requirements of the computer.Hence, it is not practical to trade more traces for less time.Among the mentioned works, only Steffen et al. (2022) conducted attacks on an FPGA-based implementation of Dilithium, but they did not analyse more key coefficients.
In terms of attack methods, profiled attacks require more from the adversary's abilities, as it necessitates the attacker to access the same device during both the modeling phase and the attack phase.The attacker must have complete control over the device during the modeling phase.Futhermore, in the Dilithium implementation, NTT operations can be pre-calculated and cannot be subjected to profiled attacks.For profiled attacks on polynomial multiplication, the large key guessing space requires a substantial amount of data for modeling, and the lower signal-to-noise ratio in the FPGA implementation makes profiled attacks challenging.Therefore, this paper only focuses on non-profiled attacks.

Contributions
The contributions of this paper are summerised as follows: • We provide a more comprehensive and feasible analysis using power leakages from a hardware implementation of Dilithium.We analyse its characteristics and then use the CPA method to attack.In this analysis, it has been demonstrated that partial key coefficients can be recovered with a minimum of 70,000 power traces.• We precisely extract Point-of-Interests (PoIs) from power traces using parallel execution operations independent of the key coefficient.This method is referred to as CPA-PoI.Compared to the CPA attack, it reduces the scale of the Pearson calculation and also decreases the number of power traces required for the attack, with a reduction of up to 16.67%.Its average Guessing Entropy is lower than that of the CPA method.When attacking with the same number of power traces, the number of recovered key coefficients is increased by up to 55.17%.• We propose a better method to reduce noise, called CPA-ITR, by using the leakage operations from other key coefficients in parallel.Compared to the CPA method, this method reduces the number of power traces used by up to 25%.At the same time, when attacking with the same number of power traces, the number of recovered key coefficients can be improved by up to 93.10%.

Organisations
The structure of this paper is as follows: Sect. 2 provides notations involved in this paper, and gives a brief introduction to Dilithium v3.1 signature scheme.It also introduces the target FPGA implementation and Correlation Power Analysis.Section 3 provides a detailed description of target operation and methods proposed in this paper.Section 4 introduces the experimental setups and analyses the experimental results of our attacks.Finally, conclusions are given in Section 5.

Notations
Let n and q be two integers, where n = 256 and q = 8380417 = 2 23 − 2 13 + 1 .We use R q to denote the polynomial ring Z[x]/(x n + 1) , the infinity norm ||x|| ∞ denotes the maximum absolute value among all coefficients of a polynomial x.For a polynomial vector, this norm is defined as the maximum infinity norm of all polynomials in the vector.Therefore, S b denotes the set of polynomials in R q with infinity norm equal to b, while Sb denotes the same set but excluding coefficients with value −b .Additionally, the set of polynomials in R q with exactly τ nonzero coefficients and infinity norm equal to 1 is denoted as B τ .We use bold lowercase letters to denote vectors (e.g.v ), and bold uppercase letters to denote matrices (e.g.A ). Polynomials in the NTT domain are denoted with a hat (e.g.ĉ , ĉ = NTT(c)).This nota- tion is transitive, so ŝ denotes each polynomial in s being individually transformed into the NTT domain.Algorithm 1 Key Generation

CRYSTALS-Dilithium
Dilithium consists of three algorithms: key generation, signature generation, and signature verification.Because the signature verification is unrelated to the attacks mentioned in this paper, it will not be further introduced here.

Key Generation
The key generation process generates a private key for signature generation and a public key for verification, as shown in Algo. 1. From this, it can be seen that finding the private key from the public key is essentially equivalent to solving the M-LWE problem.Additionally, once an attacker obtains either value s 1 or s 2 , they can derive the other value directly since A and t are public values.The Power2Round function is used to split the M-LWE problem instance t into high and low bits in order to compress the size of the public key.
Signature GenerationThe signature generation process is shown in Algo. 2. It begins by recomputing A and hashing the message M along with the hash value of the public key tr.The loop is terminated by generating noise y ∈ S ℓ (γ 1 −1) using the ExpandMask function.Compress the w = Ay to w 1 using the HighBits func- tion.The hint h allows the verifier to recompute w 1 .The hash function H instantiates the random oracle required in the proof.It returns a sparse ternary polynomial c ∈ B 60 , which has a Hamming weight of 60 and all non-zero coefficients equal to +1 or −1 .The Decompose function returns both HighBits and Low-Bits of its input.Finally, the check is performed to determine if the current signature is rejected, and if so, it is recomputed.Otherwise, the signature σ = (c, z, h) is generated.

Algorithm 2 Signature Generation
Dilithium ParametersThe first submission (Ducas et al. 2017) to the PQC competition underwent several parameter modifications during the NIST PQC standardisation process.The current version can be found in Bai et al. (2021), and compared to the previous submission (Ducas et al. 2019), the main adjustments were made to the k and ℓ dimensional parameters in order to better comply with NIST's security levels.The parameters of the current version 3.1 can be seen in the Table 1.

Target FPGA implementation of Dilithium
Our attacks are conducted on the hardware implementation Dilithium by Beckwith et al. (2021).It is known for its good performance as an FPGA implementation with high speed.The algorithm of the signature generation is shown in Fig. 1, which is divided into pre-computation phase and rejection loop phase.During the pre-computation phase, the key is decoded and transformed into the NTT domain.At the same time, the calculations for w and y , which are required in the rejection loop (Line 5 to 22 in Algo.2), are computed in advance, and the calculation results are directly provided to generate the signature in the rejection loop.The calculation in the rejection loop is divided into two-stage pipelines, Stage-0 and Stage-1.Stage-0 prepares for the next Stage-1 calculation.If the generated signature σ passes in Stage-1, it is out- put as the signature without performing the calculations within the red dashed line.If it fails, the loop continues until a valid signature is generated.
In the implementation, polynomial multiplication (including NTT), addition, and multiplication are implemented through polynomial arithmetic units.It utilizes four butterfly units to process four coefficients in parallel for all operations.For the multiplication, reduction of the computed result is required.The hardware implementation uses Barrett reduction, which can be implemented using only shifting and addition operations.
Therefore, when analysing the implementation of polynomial multiplication, the rejection loop and the polynomial arithmetic units bear a high degree of parallelism, resulting in a large amount of algorithmic noise during side-channel analysis, which affects the attacking results.

Correlation Power Analysis
Side-channel analysis utilizes the power consumption or electromagnetic information generated by the execution of a cryptographic algorithm on a device in order to extract sensitive information.Correlation power analysis (CPA) is one of the methods used in side-channel analysis.It is essentially an improvement of DPA.In practical attacks, the classical CPA typically involves five steps:  The index of the maximum value in matrix R corresponds to the leakage point and the key used in the targeted operation.

Vulnerable region
In the process of Dilithium signature generation, polynomial multiplications are calculated many times.The calculations of cs 1 and cs 2 involve the public variable c, while s 1 and s 2 are both parts of the private key, making (1) them suitable for CPA.This paper focuses on attacking the hardware implementation of cs 1 .In the context of Dilithium, c is a polynomial with n = 256 coefficients, each ranging from −1 to 1. s 1 is a polynomial vector con- sisting of ℓ polynomials, with a total of ℓ × n coefficients, each ranging from −η to η .To speed up the computation of polynomial multiplication, Dilithium algorithm introduces NTT.Therefore, before performing polynomial multiplication, the polynomials are mapped to the NTT domain and then multiplied using point-wise multiplication, as shown in Eq. 2.
In the FPGA implementation (Beckwith et al. 2021), the polynomial arithmetic unit is responsible for polynomial multiplication (including NTT), addition, and subtraction.The design utilises a 2 × 2 butterfly structure, which is applied to all polynomial arithmetic operations, enabling parallel processing of four coefficients.Barrett reduction is used for multiplication, which involves fewer multiplications than that of Montgomery reduction.
To avoid leaking the private key, Dilithium employs rejection sampling (Step 5 to 22 in Algo.2), with the loop repetitions of 4.25.The parallelised pipeline design accelerates signature generation.However, it also introduces additional noise during side-channel analysis due to the simultaneously executed operations such as y = ExpandMask(ρ ′ , κ) , which involves the keccak function, in the calculation of (ĉ • ŝ1 ) .As shown in Fig. 2, different colors are used in the figure to indicate the different states during the execution of keccak.This noise can affect the effectiveness and efficiency of the attack.

CPA on polynomial multiplication
In the C reference implementation of Dilithium, the NTT operation omits the reduction operation to reduce the computational load.After 8 levels of recursive butterfly calculations, a 256-dimensional polynomial yields coefficients in the NTT domain ranging from [−η − 8(q − 1), η + 8(q − 1)] .At security level 2, where (2) Fig. 1 The architecture of the target FPGA implementation of Dilithium η = 2 and q = 2 23 − 2 13 + 1 , the size of key guessing space in a CPA is close to 2 27 .This leads to a relatively large computational complexity when performing CPA on the Dilithium implementation.
However, in the target implementation, Barrett reduction is applied after the butterfly multiplication, which limits the range of polynomial coefficients in the NTT domain to [0, q).Therefore, in the attack on Dilithium, the key guessing space used by the attacker is [0, q).
For software implementations, it is difficult to analyse the continuous numerical processing on the bus, so the Hamming weight of the targeted variable is often used as a leakage model (Chen et al. 2021;Qiao et al. 2023).For hardware implementations, it is possible to analyse the previous reference value of the operation by examining registers, hence the Hamming distance model is commonly used to model the power consumption generated by the targeted operation (Steffen et al. 2022).The actual power consumption L can be expressed as the Hamming distance between the values before and after (Note: the coefficient index is i) the register is assigned a value, as shown in Eq. 3.
where α denotes the scaling factor, and N refers to the noise.
From Sect.2.4, we learn that the number of points in a power trace directly affects the scale of calculating the correlation coefficient matrix.In hardware implementation, the value of a register can only change once per clock cycle.Therefore, assuming no clock delay, the variations in register values are closely related to the intervals between points on the power trace in consecutive clock cycles.Specifically, the number of sample points within one clock cycle can be calculated by dividing the sampling rate by the clock frequency.During an attack, the power trace can be sliced to obtain leakage points more accurately.This can be done by adding the current sample point index to the product of the number of sample points and the clock interval for register value changes, in order to determine the location of the leakage point.These leakage points can then be used to replace the entire segment of the power trace, to construct the correlation coefficient matrix.
It should be noted that the above description is based on ideal conditions.However, in practical applications, there is often some delay in the data.Therefore, the selected leakage points during the attack process will be adjusted by plus or minus 15 from their original positions to ensure that the leakage points fall within the selected sample points for the attack.

CPA-PoI on polynomial multiplication
In the attack on cs 1 by Steffen et al. (2022), a million elec- tromagnetic traces were utilised.Even when using a million power traces for the attack, the computation scale of the Pearson correlation is still enormous, even with the aforementioned CPA method.In CPA, if PoIs can be accurately selected, it not only reduces the computational scale of the correlation coefficient matrix but also enables a more direct determination of the location of information leakage.Therefore, we first select points by locating the leakage point, and then proceed with the CPA method for the attack, we refer to this attack method as CPA-with-PoI (abbr.CPA-PoI).
In side-channel analysis, any points in the power trace that exhibit significant differences are considered as PoIs.In profiled attacks, the selection of the number of PoIs directly affects the size of the Pearson correlation.Therefore, it is necessary to identify the PoIs in order to reduce the size of the templates.Similarly, in CPA, if the PoIs can be accurately selected, it not only reduces the computational scale of the correlation coefficient matrix but also In the same clock cycle, the timing of value changes for different registers may overlap.This overlap can result in partial repetition or overlap of PoIs in the power traces.Therefore, PoIs of the target operation may coincide with PoIs of other operations.By observing the implemented architecture, it is found that during the computation of (ĉ • ŝ1 )[i] , the ĉ needed for subsequent calculations is fetched in advance, thereby generating power consumption.Additionally, due to the 2 × 2 butterfly structure, there are four consecutive indices of ĉ coefficients that can be processed in parallel.The figures obtained by calculating the Pearson correlation using ĉ , (ĉ • ŝ1 )[i] , and the power traces are shown in Fig. 3.It can be seen that the challenge ĉ coincides with the PoIs of the key in the power trace, and the correlations with ĉ are high.Since ĉ is known, it can be used to extract PoIs.
Based on this idea, after using the power trace segmentation method in the aforementioned CPA, we calculate the Pearson correlation coefficient between the Hamming distance of the ĉ coefficients and power traces, then select several values with the absolute value of the relevant coefficient at the forefront, and use the points in the corresponding power trace to perform CPA.By using this method, we can extract PoIs more accurately.

CPA-ITR on polynomial multiplication
In CPA, the Hamming distance of a coefficient's target operation is used as an assumed power consumption.In the FPGA implementation of Dilithium, polynomial multiplication involves the simultaneous use of 2 × 2 butterfly structures, as shown in Fig. 4. The values within the square brackets in the figure denote the coefficient indices of ĉ and ŝ1 .A good power consumption model can help an attacker accurately identify, extract, or infer sensitive information from the target device.According to Eq. 3, for the actual power consumption L generated by the same power trace, the smaller the noise N, the closer the assumed power consumption ĉ • ŝ1 [i] is to H.In this case, the attack achieves the best results.A single coefficient's assumed power consumption may not be sufficiently close to the actual power consumption.
At the same moment, the coefficients being processed have indices i, i + 1 , i + 2 and i + 3 , if only the operation with the coefficient index i is attacked, the other coefficients are considered as part of the noise N, resulting in higher noise levels.Meanwhile, in the scenario of parallel attack on all four coefficients, the size of key guessing space has changed from q to q 4 , making the attack more difficult.
In practical attacks, for parallel computations occurring at the same moment, they are usually attacked in the order of their indices.This means that when attacking the operation with index i + 1 , the related value for the operation with index i is already known.Therefore, after recovering the key coefficient with index i using the CPA method (using Eq. 4 as the assumed power consumption), Eq. 5 can be used as the assumed power consumption to recover the key coefficient with index i + 1 , considering that Eq. 4 is known.This process can be repeated sequentially, modifying the assumed power consumption model (Eq.6, Eq. 7), to better model the leakage and reduce the noise levels.Here, due to the parallel nature of the target operation, we store the information obtained from previous attacks and combine it with the assumed power consumption model for subsequent attacks, in order to reduce noise.We refer to the method of dynamically overlaying the assumed power consumption model in CPA attacks as CPA-with-Iteration (abbr.CPA-ITR).

Experimental results
Setup The power traces of polynomial multiplication execution in Dilithium are collected on the SAKURA-X development board for side-channel evaluation.The setup is shown in Fig. 5, which consists of a ROHDE & SCHWARZ PA303 30dB pre-amplifier, PicoScope 3206D oscilloscope, and SAKURA-X FPGA development board with Xilinx Kintex7 XC7K160T chip.The chip runs at 4MHz.The oscilloscope can simultaneously use two channels to sample with a 4ns interval (i.e.sampling rate 250MS/s).One channel is used for collecting traces of instant power consumption, and the other is used to trigger for sampling.
Target FPGA Implementation Our target implementation is the FPGA implementation of Dilithium v3.1 in Beckwith et al. (2021).For the purpose of analysis, the security level is set to 2 (increasing the security level will increase the number of key coefficients but will not make the attack more difficult).It is implemented using both (4) (5) Verilog and VHDL languages.This introduces an delayoptimised FPGA design and covers the three security levels of Dilithium.We programme the signature generation of Dilithium into the SAKURA-X target board and send the message and its length from the PC to the target board for signing.We use the ISE − 14.7 Design Suite to programme the Kintex7 XC7K160T chip.
Some Settings The power traces collected for the polynomial multiplication on cs 1 are shown in Fig. 6, where a clear periodicity can be observed.Using these traces, an attacker can analyse the key coefficients sequentially.Due to the parallel execution of the keccak function during the polynomial multiplication process, as shown in Fig. 2, each state generates different algorithmic noise, which affects the attacks differently.In order to evaluate and compare the results of the attack experiments, we obtained the state of the keccak function execution for each clock cycle through timing simulation and conducted separate attack experiments accordingly.Fig. 7 shows the correlation trend of the most likely key guesses for 16,761 (i.e.⌈ 8380417 500 ⌉ ) under the six states of the The correct key guess is indicated in red, where the correct key guesses are marked in red and standing out with more traces.From the graph, it can be observed that a significant correlation is only observed after 700,000 power traces for the shake_process and process_ last_block states, which leads to a longer attack time.Therefore, in the experimental process of this paper, not all key coefficients were targeted for attack, but only a selected few were chosen based on the target operations present in each state of keccak.For reset and fina-lization_SHAKE states, there are a total of 8 and 24 parallel target operations respectively, so all coefficients were chosen for attack.For shake_output_wait and idle states, which have a larger number of parallel target operations, 64 were selected as reference.Due to the excessively long attack time, only 32 were chosen as reference for the shake_process and process_ last_block states.

Results of the CPA method
First, the CPA method is executed, using the power trace segmentation method for the attack.The calculation scale at this stage is N × P × q , where N denotes the number of power traces used and P corresponds to the number of sampling points selected per power trace, which is set to 31.The experimental results of using different power traces for the attack are shown in Table 2.The critical number of power traces required to fully recover the key   2 The results of key recovery using different methods in different keccak states coefficients is the threshold, and the average Guessing Entropy (GE) is the mean of the initial Guessing Entropy for incorrectly recovered key coefficient guesses.For the shake_output_wait state, recovering 64 key coefficients requires 120,000 power traces.For the finali-zation_SHAKE state, recovering 24 key coefficients requires 80,000 power traces.For the reset state, recovering the group key coefficient requires 70,000 power traces.For the idle state, recovering 64 key coefficients requires 110,000 power traces.Lastly, for the shake_ process and process_last_block states, recovering 32 key coefficients requires 980,000 and 880,000 power traces, respectively.The number of power traces required to recover the key coefficients is consistent with the trend shown in Fig. 7, with a drastic increase in the number of power traces during the shake_process and process_last_block states, indicating a higher level of algorithmic noise during these states.However, Chen et al. ( 2021) used the CPA method on the ARM implementation of Dilithium to recover the private key with only 157 power traces.This clearly demonstrates a significant difference in the difficulty between attacking software and hardware implementations.

Results of the CPA-PoI method
Based on the CPA method, the CPA-PoI method was adopted to more accurately locate the PoIs.When selecting the PoIs, we used the Hamming distance parameter ĉ and selected 3 points.The calculation scale in this case is N × P × q , where P is 3. Compared to the CPA method, the CPA-PoI method has a reduced calculation scale.
According to the results in Table 3, compared to the CPA method, the CPA-PoI method reduces the number of power traces required to fully recover the target key coefficients by [0, 16.67%].Even when the number of power traces is not reduced, the average GE of the CPA-PoI method is generally lower than that of the CPA method, which indicates that the effect of the CPA-PoI method can be observed by further refining the number of power traces used in the attack.
In cases where the CPA method has not fully recovered the target key coefficients coefficients, when attacking with the same number of power traces, the CPA-PoI method significantly increases the number of recovered key coefficients, with an improvement range between [10%, 55.17%].This indicates that, with the same number of power traces, the CPA-PoI method can recover more key coefficients and achieves better results compared to the CPA method.

Results of the CPA-ITR method
Based on the CPA method, we employed the CPA-ITR method to effectively leverage parallelization for information leakage.By comparing the experimental results with the CPA and CPA-PoI methods, we derived Table 3.
From Table 3, it can be seen that compared to the CPA method, the CPA-ITR method reduces the number of power traces required to fully recover the target key coefficients by [0, 25%], and compared to the CPA-PoI method, it reduces the number by [0, 14.28%].Additionally, the CPA-ITR method significantly decreases the average GE, which is close to that of the CPA-PoI method.When attacking with the same number of power traces, the CPA-ITR method improves the number of recovered key coefficients compared to the CPA method by [15%, 93.10%].In comparison to the CPA-PoI method, the improvement range is [0, 33.33%].Overall, the CPA-ITR method demonstrates significant improvement Table 3 Comparison of attacks using different methods in different keccak states compared to the CPA method and offers certain advancements over the CPA-PoI method as well.
It is important to note that in the CPA-ITR attack, if the first attacked key coefficient is incorrectly recovered, it may cause interference in the attacks on the other three key coefficients in parallel.However, during our experiments, we observed that the recovery of the first coefficient was generally easier to achieve in our implementation.

Discussion
The proposed methods proposed are not only targeted for the implementations of Dilithium but also can be applied to other algorithms.For the CPA method, no parallel information is used, except for the difference in the power consumption model, which is generally applicable to the implementation of microprocessor.CPA-PoI utilizes parallel leakage of known information operations that are not related to the key coefficient, and is not applicable if there are no similar leakage for implementation on microprocessor.CPA-ITR uses a leakage where multiple key coefficients are executed in parallel, and this method can be applied to implementations that use similar parallel structures for computation.In summary, our methods can be applied to attack implementations of other algorithms.

Conclusion
The paper presents a practical attack on the FPGA implementation of Dilithium, This is a more comprehensive work to attack the FPGA implementation of Dilithium using power leakages.By fully utilizing the characteristics of FPGA implementation, we have improved CPA with two methods, namely CPA-PoI and CPA-ITR, both of which demonstrate better performance compared to CPA in our experiments.Our work demonstrates the feasibility of side-channel attack to polynomial multiplication operations on highly parallelised hardware.It suggests that the FPGA implementation of CRYSTALS-Dilithium is more vulnerable than thought before to sidechannel analysis.In future work, we plan to explore more in-depth analysis of masked implementation using highorder CPA-PoI and CPA-ITR.However, due to the large key guessing space and the impact of algorithm noise, attacks on both unmasked and masked implementations take a long time.Therefore, we also aim to find better attack strategies to efficiently recover the private key with less time.
Finally, we use • to denote pointwise multiplication.HD(a • b)[i] denotes the value of the Hamming distance after calculating the i-th coefficient of the register storing the computation result of a • b.

•
Select an appropriate intermediate value as the attack position.The calculation function of this intermediate value takes the key (or a fixed value from which the key can be derived) and known variables as inputs.• Collect the power traces of the targeted operation.Execute the signing process n times and store the power traces of each collection, with each trace consisting of m data points, in the matrix T n×m .• Compute the intermediate value matrix V n×k for the guessed key.Based on the range of key coefficient values, the size of key guessing space k can be determined.Using the assumed key and known variables, the intermediate values of the targeted operation can be calculated.• Using an appropriate power consumption model, map the values of the intermediate value matrix V n×k to the assumed power consumption matrix H n×k one-to-one.• Compute the correlation coefficients between the assumed power consumption matrix H n×k and the actual power consumption matrix T n×m for each column, and record them in the correlation matrix R k×m .The calculation of correlation coefficients can be done using Pearson's correlation coefficient formula, as shown in Eq. 1.

Fig. 2
Fig. 2 The overlapping of keccak and (ĉ • ŝ1 ) in the time domain

Fig. 3
Fig.3The Pearson correlation between the hypothetical leakages related to cs 1 and the traces

Fig. 6
Fig. 6 Power traces of FPGA implementation of Dilithium in our setting