An efficient permutation approach for SbPN-based symmetric block ciphers

It is challenging to devise lightweight cryptographic primitives efficient in both hardware and software that can provide an optimum level of security to diverse Internet of Things applications running on low-end constrained devices. Therefore, an efficient hardware design approach that requires some specific hardware resource may not be efficient if implemented in software. Substitution bit Permutation Network based ciphers such as PRESENT and GIFT are efficient, lightweight cryptographic hardware design approaches. These ciphers introduce confusion and diffusion by employing a 4 × 4 static substitution box and bit permutations. The bit-wise permutation is realised by simple rerouting, which is most cost-effective to implement in hardware, resulting in negligible power consumption. However, this method is highly resource-consuming in software, particularly for large block-sized ciphers, with each single-bit permutation requiring multiple sub-operations. This paper proposes a novel software-based design approach for permutation operation in Substitution bit Permutation Network based ciphers using a bit-banding feature. The conventional permutation using bit rotation and the proposed approach have been implemented, analysed and compared for GIFT and PRESENT ciphers on ARM Cortex-M3-based LPC1768 development platform with KEIL MDK used as an Integrated Development Environment. The real-time performance comparison between conventional and the proposed approaches in terms of memory (RAM/ROM) footprint, power, energy and execution time has been carried out using ULINKpro and ULINKplus debug adapters for various code and speed optimisation scenarios. The proposed approach substantially reduces execution time, energy and power consumption for both PRESENT and GIFT ciphers, thus demonstrating the efficiency of the proposed method for Substitution bit Permutation Network based symmetric block ciphers.


Introduction
The Internet of things (Ashton 2009) is an ever-growing network of uniquely identifiable smart connected devices that sense, communicate and share information using heterogeneous networks.Smart IoT applications (Rejeb et al. 2022) are set to bring remarkable benefits to human lives by digitising the day-to-day used physical assets.However, IoT, a fragmented technology, encompasses heterogeneous-natured devices with several limitations and challenges hindering its widespread adoption (Nazish and Banday 2018).Because of the generic constraints associated with these devices in terms of area, bandwidth, memory, power and battery life, together with the financial limitations, security has often been an afterthought, resulting in minimal space left for the crypto implementation.This restricts the application of conventional and standardised crypto primitives for securing IoT devices.Nevertheless, lightweight cryptography attempts to design efficient primitives to mitigate most existing threats while proving less resource intensive.
Most cryptography revolves around the block cipher design due to their profound use in designing pseudorandom number generators, key establishment protocols, MAC, hash and encryption primitives.The design target metrics considered in lightweight design are mainly area (slice or flip-flop count), memory footprint (RAM/ ROM), latency, power and energy consumption.However, the heterogeneous nature of IoT devices and the trade-offs between various metrics make it quite challenging to design a 'one design fits all' lightweight block cipher.Therefore, specific featured lightweight cryptographic primitives have been designed for diverse smart IoT applications.For example, PRINCE (Borghoff, et al. 2012) and Midori (Banik et al. 2015) are low-latency and low-energy block primitives, respectively.On the other hand, PRESENT (Bogdanov et al. 2007), SIMON (Beaulieu et al. 2015) and GIFT (Lee 1989) are hardware-efficient primitives, with ITUBee (Karakoç et al. 2013) and SPECK (Beaulieu et al. 2015) being examples of softwarefriendly block ciphers.
A sub-class of SPN network-based ciphers known as SbPN primitives, such as PRESENT and GIFT block ciphers, are remarkably efficient in hardware because of their s-boxes and bit-permutations.Both use lightweight 4 × 4 s-boxes to offer confusion, whereas, for diffusion, bit-permutations involving zero gate count are used.However, in software, bit-permutations are the most inefficient in terms of the instruction count and execution time.As such, to make the SbPN-based primitives' software efficient, several methods have been used, such as table-based implementations (Heys 2020), bit-slicing (Kwan 2000), fix-slicing (Adomnicai et al. 2020) in addition to the use of the bit manipulation instructions (Lee 1989).However, even though these implementation techniques provide impressive results, they have some form of complexity in terms of large memory footprint or high energy and power consumption requirements.Moreover, they are not always cost-effective, limiting their use for securing low-end embedded devices.Thus, rather than getting bogged down with addressing security concerns for lightweight IoT devices from a hardware or only software perspective, designing ciphers from a use-case perspective with optimum efficiency in both is the ultimate requirement for smart IoT applications.Thus, efforts must be put to use the appropriate confusion and diffusion implementation techniques that are both hardware and software efficient to have an optimum and balanced crypto design suitable for several applications.
This paper provides a novel method to offer bit-permutation-based diffusion for permutation operation in Substitution bit Permutation Network based ciphers using Bit-Banding feature of contemporary ARM Cortex-M processors.The proposed approach has been implemented, analysed and compared using GIFT and PRESENT ciphers.
The paper is structured as follows: "Background" section summarises the SbPN-based PRESENT and GIFT block primitives and outlines their existing implementation techniques.This section also explains the bit-band feature available with the ARM Cortex-M processors and lists various software-efficient compiler optimisation techniques."Related work" section presents the literature survey of the software efficient block cipher implementation techniques."Proposed work" section discusses the proposed software-efficient implementation technique for performing the permutation in PRESENT and GIFT block primitives."Implementation" section explains the implementation methodology and provides a comparative analysis of the results obtained for the direct and proposed methods in terms of various performance metrics.In addition, this section reports the code and performance improvements obtained for the proposed method using seven optimisation techniques.Finally, "Results and discussions" section provides the summarised results of the proposed technique.

SbPN ciphers
The lightweight block cipher design uses a round function iterated a specific number of times to achieve an optimum security margin.Furthermore, the design must satisfy Shannon's confusion and diffusion paradigm (Shannon 1945).Diffusion means each output bit should be influenced by each plaintext and key input bit.The confusion ensures the complicacy of this dependency, which ascertains that the relationship between the input and output bits is complex and hard to reverse.Non-linear components such as s-box, boolean functions and non-linear arithmetic operations offer confusion in addition to a small amount of local diffusion.Mainly, linear elements such as Maximum Distance Separable (MDS) matrices, bit-permutations, circular shifts, XOR and swap operations are employed to offer diffusion on a global level.Substitution Permutation Network (SPN) is one of the most used secure block cipher construction schemes, utilising s-box, p-box and XOR to realise a round function.A special class of SPN ciphers is the Substitution bit Permutation Network (SbPN) based primitives.An m/n-SbPN is an n-bit block cipher with each s-box being m-bit wide.These ciphers use only the bitpermutations to realise the linear layer.
Permutations at the bit level find vast applications in cryptography and digital processing for faster security and multimedia operations.Bit-permutation has been used in several famous ciphers such as DES, Serpent, PRESENT, GIFT and many more.Permutations of two and six types have been employed in the hardware-oriented Serpent (Biham et al. 1998) and DES (Biryukov and Cannière 2006) ciphers, respectively.Bit-permutations can be invertible or non-invertible.Compression and expansion p-boxes (Forouzan et al. 2015) are examples of non-invertible permutations.In compression p-boxes, the bit-wise permutation is performed so that the diffused output bits are less in number than the input bits.As a result, several input bits are not mapped to the output.This is useful when the next round needs fewer bits than the previous one.On the other hand, in expansion p-boxes, several input bits are mapped to more than one output bit, which results in a more significant number of diffused output bits than the input bits.This is used in ciphers where the next round needs more bits than the previous one.The irreversible compression and expansions p-boxes are not utilised in SPN or SbPN ciphers, which instead use straight invertible p-boxes with an equal number of input and output bits.The following section details the hardware-efficient SbPN ciphers-PRE-SENT and GIFT and summarises their existing implementation techniques.

PRESENT block cipher
PRESENT is one of the premier hardware-efficient lightweight block ciphers proposed by Bogdanov et al. in 2007.It is an SbPN-based block cipher, having a fixed block size of 64 bits and variable key lengths of 80-or 128-bits.Figure 1 depicts the encryption process of the PRESENT64/80 cipher consisting of 31 rounds followed by a final post-key-whitening stage.Each round consists of a keyed XOR (addroundkey) and keyless substitution (s-BoxLayer), and permutation (p-Layer) sub-stages.
AddRoundKey: The key scheduling algorithm takes an 80-bit shared key as the input and generates 64-bit sub-round keys using a simple round function involving circular shift, s-box and round constant addition operations.Each sub-round key is bit-wise XORed with the 64-bit input state.
S-BoxLayer: The XORed output is applied as input to 16 invertible 4 × 4 static s-boxes.Each s-box takes 4 bits (X) as input and yields the confused 4-bit output (S[X]).Apart from confusion, these s-boxes offer a local diffusion.Table 1 lists all the s-box output values corresponding to the 16 inputs in hexadecimal notation.
P-Layer: The 64-bit output from 16 s-boxes is applied as an input to the p-layer that performs a bit-wise diffusion.The bit at index location i is shifted to location P(i) as per the diffusion Table 2.
Being an SPN cipher, each sub-operation in a round function is invertible.As such, decryption is the reverse of the encryption process.It involves static 4 × 4 inverse s-box and inverse p-layer realised using a 16-byte lookup table and 64-bit bit-wise reverse diffusion, respectively.The sub-round keys generated by the key scheduling algorithm are applied in reverse order.
The following section summarises several implementation methods for performing permutation in the PRESENT Block Cipher: • Direct Method: In the direct method, the bit-wise permutation of the s-box layer output is realised using the bit-rotation method.Each bit permutation requires four sub-operations comprising the generation of the mask, masking (AND), shifting and XOR.All the methods mentioned above are software efficient; however, these have drawbacks in terms of memory footprint, instruction count and timing requirements.The direct method is a memory-efficient technique.However, this incurs a substantial overhead due to the requirement of several mask, shift and XOR sub-operations for the ciphers with large block sizes.The table-based methods are known for their high-speed execution.These have comparatively fewer instruction requirements than the narrow-table or direct approach.However, these highly memory-intensive methods require 32 and 16 SP tables for encryption and decryption in wide and combined-wide table implementation methods, respectively.Furthermore, these methods employ several bit mask, shift and XOR operations to apply a lookup operation on a specified data nibble or byte.

GIFT block cipher
Although PRESENT is a hardware-oriented cipher, it is not much resistant to linear cryptanalytic attacks.Also, it utilises a high branch numbered s-box that proves costlier in terms of area footprint.Therefore, Banik et al. worked towards designing a comparatively lightweight and more secure cipher and finally came up with an improved version of PRESENT in 2017 named the GIFT block cipher.Unlike PRESENT, which uses an s-box with branch number 3, GIFT uses a reduced branch numbered two s-box that proves more area and cost-efficient and is more resistant against the linear cryptanalytic attack.
GIFT is an SbPN-based symmetric block cipher with two versions: GIFT 64/128 and GIFT 128/128, having a fixed key length of 128 bits with varying block sizes and rounds.For block-size of 64 and 128 bits, 28 and 40 rounds are used, respectively.GIFT64/128 (Fig. 2) encryption process utilises a keyalternating construction with two keyless (subcells and permbits) and one keyed (addroundkey) sub-stage: Subcells: The 64-bit input state is applied nibble-wise to the 4 × 4 static s-box to offer optimum confusion and a small amount of diffusion.Each nibble 'X' is replaced with 'S[X]' using the pre-defined s-box mapping shown in Table 3.
Permbits: This layer performs a bit-wise 64-bit permutation on the output bits of 16 parallel s-boxes.A bit at index position i is shifted to the P(i) bit position as per the permutation table given in Table 4.
AddRoundKey: The diffused state bits from the permbits stage are XORed with the sub-round key bits and round constants.Each sub-round key generated from the key scheduling algorithm using simple extraction and circular shift operations is 32 bits in size.Therefore, only 32 bits out of the 64-bit state are bit-wise XORed with the sub-round key for greater hardware efficiency.This saves the computational costs associated with the XOR operations, making the cipher efficient in hardware and software.
AddRoundConstants: Six input state bits are bit-wise XORed with six round constants.In addition, bit (b 63 ) is XORed with '1' .
GIFT decryption involves using a 4 × 4 static inverse s-box and inverse p-layer to offer confusion and diffusion in the cipher.The round keys generated from the key generation algorithm are applied in reverse order to obtain the original message.
The following methods exist for performing permutation in the GIFT Block Cipher: • Direct Method: In the direct Implementation, the permutation layer takes the output state from the s-box as input.It performs diffusion using the bitrotation method involving several mask generation, masking, shifting and XOR sub-operations.• Bit-Slicing Method: In the bit-slicing technique, diffusion is performed using masking, shift and XOR steps simultaneously on bits in a given slice.This, in turn, amounts to the requirement of multiple such operations for permuting bits in multiple slices, resulting in a higher cycle count and delayed execution.Bit-sliced permutation can also be performed by transposing and then subjecting each slice to different row-swapping operations determined by the slice number.• Fix-Slicing Method: In the fix-slicing method, the first slice is not subjected to any diffusion operation, whereas the rest of the three slices undergo row-wise and column-wise rotations.The direct implementation method proves to be a memory-efficient technique.But, this incurs a substantial overhead due to the requirement of several mask, shift and XOR sub-operations for the ciphers with large block sizes.The bit-sliced-based computational process is more straightforward and faster because the plaintext block is divided into multiple slices.Also, it permits the processing of multiple blocks in parallel.However, it can prove inappropriate for low-end IoT devices that usually work with much smaller payloads.Also, substantial overhead is associated with the diffusion layer, as bits must be transposed in the slice individually rather than in large chunks, making it computationally intensive.Even though bitslicing can improve speed, the overheads associated with packing and unpacking data at the start and the end of the encryption and decryption processes make the process quite resource-consuming for ultra-lightweight devices.Furthermore, this method uses several general-purpose registers to store the transposed bits of a given message.Unfortunately, low-end IoT processors often have a minimal number of such registers, thereby increasing the number of load and store instructions that degrade the overall performance.Moreover, the bit-sliced permutation in GIFT cipher involves multiple mask, shift, and XOR operations, thereby incurring large computational overhead regarding the number of cycles.The fix-slicing technique saves multiple operations by replacing the transposition and row-switching operations with row and column rotations, thereby increasing the speed of the cipher.It also takes advantage of the barrel shifter capability available with the ARM Cortex architecture for performing multi-bit rotations in a single clock cycle, thereby making the implementation of the linear layer less costly.However, the round keys and round constants need to be modified as per the new bit positions, which incurs additional computational overheads.

Bit-band memory
ARM Cortex-M (Banday 2018; Rouf et al. 2022) are 32-bit processors primarily designed for deeply embedded microcontrollers and IoT market spaces.These low-powered processors feature several energy modes, barrel shifters and pipelined architectures.This makes them suitable for diverse low-power and low-latency IoT applications.Furthermore, these processors are based on ARMv7 instruction set architecture and support the Thumb-2 instruction set, which includes a mix of 16and 32-bit instructions, making them highly suitable for high-performance and memory-deficient IoT applications (Schwabe and Stoffelen 2017;Kim, et al. 2022).In addition, ARM Cortex-M processors have optional support for bit manipulation using the bit-banding feature.Unlike other processors, which include separate bit-manipulation processors or use specific instructions to perform bit-level manipulations that increase the overall design cost, these processors incorporate a unique feature of bit-banding that uses two memory regions, bit-band and bit-band alias, to support bit-wise operations.Regular access to the bit-band region results in a word read or write operation.On the other hand, normal read or write to the bit-band alias region results in single-bit access in the corresponding bit-band region.This is because each bit in the bit-band region is mapped to a word (more specifically, to the least significant bit of the 32-bit word) in the corresponding bit-band alias area.
In ARM Cortex-M3 processors, two bit-band regions are set aside for performing the bit-band operations.These are located in the starting 1 MB of SRAM and the first 1 MB of the peripheral regions with base addresses as 0X20000000 and 0X40000000, respectively.The corresponding two bit-band alias regions are in SRAM, and peripheral regions with base addresses 0X22000000 and 0X42000000, respectively.Each bit-band alias region is 32 MB in size because each bit mapping in the bit-band region requires a word (32 bits) in the bit-band alias region.
Bit-banding offers several advantages.First, it simplifies the bit write and read operations by working directly on the appropriate bit-band alias location corresponding to a specific bit in the bit-band region.It performs bit manipulation in a single cycle.Unlike the conventional bitmodification involving read, modify and write sub-tasks, bit-banding permits atomic and uninterrupted errorfree bit operations.This also prevents conflicts in the case of multiple tasks using shared memory (Yiu 2014).Also, single-bit manipulation operation is realised using a single load or store instruction, which results in faster bit manipulations (Bai 2015).Further, it simplifies the execution of several conditional branching operations by reading a specific bit-band alias location instead of reading and masking 32 bits in the bit-band region, thereby speeding the branching decisions (Tahir and Javed 2017).

Compiler optimisation
One of the design approaches to achieve software efficiency is to employ optimisation techniques available with the compilers.This method makes the design either code-efficient with reduced RAM and ROM utilisation or can help enhance the execution speed.In addition, using optimisation techniques can help run programs faster without changing the code.The compiler uses precomputation of values, inlining functions, unrolling loops, reordering code statements, and many more to produce a much faster binary.However, the downside with the inclusion of the compiler optimisation techniques is that it can make the program hard to debug.With lower optimisation levels, detailed information about the program can be viewed, which can then be used to track down the bugs in the code.On the other hand, this feature becomes more restricted with higher optimisation levels, which hinders debugging to a greater extent.However, these levels permit high-speed or low code footprint optimisations.Thus, it is recommended to use lower or no optimisations while developing the algorithm and switch to higher optimisations once the code is released.
Several compiler optimisation options are available with the KEIL MDK Integrated Development Environment (Table 5).They either optimise the program for code size or performance, and opting for one metric degrades the other.Furthermore, depending on the type of application and the constraints involved, one can use a particular optimisation level(s).

Related work
Ruby Lee (Lee 1989) used the EXTRACT and DEPOSIT bit manipulation instructions available with the PA-RISC Precision Architecture processors to perform bit permutations.The results reported the requirement of only two instructions for performing a one-bit permutation, thus resulting in a 50% reduction in instruction count compared to the bit-rotation method that requires four instructions to perform a single-bit permutation.
Eli Biham (1997) presented a high-speed softwarefriendly bit-sliced implementation of the DES block cipher that resulted in two times increase in its execution speed.Furthermore, an average requirement of 100 gates has been reported for the hardware implementation of one s-box.
Matthew Kwan et al. (2000) propounded the bit-slicing term and used this method to improvise Biham's work with 56 gates required for a single s-box implementation.Matsui et al. (2007) provided improvised results for the AES block cipher implemented using the bit-sliced Intel Core2 processor architecture.The results report reduced execution time requirements for the proposed implementation compared to the table-based AES implementation.Bogdanov et al. (2007) proposed PRESENT, a hardware-oriented block cipher for highly constrained devices.It has a fixed block size of 64 bits and a variable key size of 80 and 128 bits.Both versions use 31 SPN rounds with a post-key-whitening step used at the end.
In his thesis, Poschmann ( 2009), provides the code and speed-optimized implementation of the PRESENT block cipher for diverse platforms with 8-, 16-and 32-bit processors.It also uses the narrow table approach for the s-box implementation, which despite being efficient in software, is prone to cache timing attacks.Benadjila et al. (2014) performed bit-sliced implementation of several block ciphers using SIMD instructions and vectorisation features available with the Intel × 86 platforms.The results report increased speed gain for the analysed ciphers, including PRESENT.Papapagiannopoulos et al. (2014) implemented various block primitives in a bit-sliced manner on the ATtiny family of AVR platforms.Improved results have been reported for the PRESENT cipher by utilising the  2017) presented a timing attack-resistant masked implementation of the PRESENT block cipher.Furthermore, the implementation involves decomposing the linear layer and realising the s-box in a bit-sliced manner using optimised boolean functions.On 32-bit ARM Cortex processors, an 8% improvement in execution speed is reported for the cipher, requiring 2100 cycles compared to that provided by FELICS.
Dinu et al. ( 2019) evaluated crypto ciphers in terms of a figure of merit calculated from various metrics such as time, RAM and ROM footprint.Nineteen block primitives have been comparatively analysed on AVR, MSP430 and ARM, which are 8-, 16-and 32-bit platforms, respectively.In the case of the PRESENT block cipher, a timeefficient implementation has been carried out utilising the combined substitution and permutation tables.Adomnicai et al. (2020) proposed a software-friendly implementation technique for the GIFT block cipher named fix-slicing.The method uses a few rotations realised using the barrel shifter feature available with the ARM Cortex-M3 processors.The results report faster execution speed requiring 800 and 1300 cycles for GIFT-64 and GIFT-128, respectively, compared to AES and PRESENT ciphers requiring 1617 and 2116 cycles, respectively.Adomnicai et al. (2020) applied the fix-slicing technique to the AES block cipher.Compared to the bit-sliced AES, the results report a 52% reduction in diffusion operations using the fix-sliced AES implementation technique, requiring only 81 cycles for a single-byte encryption on 32-bit processors.
Further, many software efficient ciphers such as REC-TANGLE (Zhang et al. 2015), a 4/64 SbPN cipher with structure similar to GIFT and PRESENT ciphers have been proposed.RECTANGLE uses shift rows to realise the diffusion layer, which is more software friendly than bit rotation method used in direct implementation methods for PRESENT and GIFT block ciphers.However, as far as its security is concerned, not much analysis has been reported regarding how the linear and differential trials are propagated in the RECTANGLE cipher.Also, its key scheduling algorithm is more complex than PRE-SENT and GIFT primitives.Furthermore, four rounds are required to attain full diffusion in RECTANGLE cipher, whereas the same is attained in only three rounds in case of the PRESENT and GIFT block ciphers.
Although the works mentioned above have attempted to make the cipher implementation efficient in software to a certain extent, however, the associated overheads in terms of larger instruction count, higher memory, power and time requirements along with the inclusion of specific bit-manipulation instructions, a significant increase in the cost of the development platforms, makes it financially and resourcefully a non-viable option to secure the constrained smart IoT applications.This necessitates designing a novel, software-friendly, cost-effective implementation technique for securing diverse low-energy and high-performant low-latency IoT applications.Further, the digital world around us is mostly embedded in nature and as such using only software efficient or mere hardware efficient crypto primitives cannot be considered as a favourable design approach for securing the low-end devices.There is a need to address the security concerns of the smart embedded applications from a holistic approach that should consider both the hardware and software aspects.This paper proposes a novel software efficient implementation method for hardware efficient SbPN ciphers to make these primitives more accessible for use in a wide range of embedded devices, particularly those with limited resource availability.

Proposed work
This paper proposes a novel software-friendly implementation technique for performing the bit-wise permutation in the SbPN ciphers by employing the 'Bit-Banding' feature of ARM Cortex-M processors.An easy, efficient, and high-speed software-efficient mapping between the bit-band alias regions is performed to achieve bit-level diffusion.
All the steps involved in the encryption round function of the PRESENT64/80 cipher (Fig. 3), except the diffusion, are performed in a manner similar to the direct implementation method.First, the 64-bit XORed output is provided to 16 (4 × 4) s-boxes that provide the confused 64-bit output state.This forms the input to the diffusion layer (P).Then, the diffusion layer in the proposed bitbanding approach is implemented as per the pseudocode (Algorithm 1) using the following steps: Step 1 Initialise the permutation table, P t (as shown in Table 2) and store it in memory.
Step 2 Declare two bit-band memory areas, P and Q, each 8 bytes wide.
Step 3 Store 64-bit output state from the sixteen s-boxes in the 'P' bit-band memory of SRAM.
In ARM Cortex-M3-based LPC1768 IoT hardware platform, the SRAM bit-band region starts from 0X200000000 × 20,000,000.However, the locations from 0X20000000 to 0X2007BFFF are reserved.The input state bits to the linear layer are stored starting from location 0X2007C000.Sixteen nibbles require eight memory byte locations for storage.As such, bits b 0 and b 63 occupy LSB of 0X2007C000 and MSB of 0X2007C007 memory locations in the bit-band region.
Step 4 Use the following mapping formula to fill in the Q a bit-band alias memory locations with the permuted state bits.
where P a and Q a are the base addresses of the bit-alias region storing the input and output of the permutation layer, respectively, P t is the array of permutation values (as given in Table 2), and i represents the bit number varying from 0 to 63.In the bit-band alias region, each bit is represented by 32-bit; storing a 64-bit state requires 64 × 4 = 256 bytes in the 'P a ' bit-band alias region with b 0 stored at 0X22F80000 through 0X22F80003 and b 63 occupying 0X22F801EC to 0X22F801EF memory locations.Similarly, in the 'Q a ' bit-band alias area, 0X22F80100 to 0X22F80103 memory locations store the permuted value b 0 .Again, the b 63 permuted bit occupies four locations starting from 0X22F801FC bit-band alias memory location.In this mapping process, the Q bit-band region from 0X2007C008 to 0X2007C00F, corresponding to the Q a bit-band alias area, gets automatically filled with the permuted output state.
In addition to the mapping formula, a scatter file is exclusively used to direct the linker to set aside the particular SRAM regions for the permutation function to avoid memory conflicts during program execution.
Step 5 Return the 64-bit permuted output state from the Q bit-band area for further processing by the following rounds.
Similarly, the decryption phase (Fig. 4) involves the following mapping formula between the inverse permutation layer's input and output state bits.
The proposed method for performing the bit-bandingbased permutation has been illustrated for the PRESENT block primitive.GIFT block cipher (Figs. 5, 6) follows the bit-banding approach similar to that used for the PRE-SENT block cipher.

Methodology
The PRESENT and GIFT lightweight block primitives have been choosen to evaluate the proposed permutation because both the ciphers offer a good balance between efficiency, security and hardware simplicity.These are used to secure RFID tags, wireless sensor networks and any low-end embedded smart IoT applications for which resource intensive ciphers like AES are not usually preferable (Bogdanov et al. 2007).The PRESENT and GIFT block ciphers have been implemented on a 32-bit ARM Cortex-M3-based LPC1768 development board.It has 64kB and 512kB of RAM and ROM, respectively and operates with a core clock frequency of 100 MHz.The availability of onboard 20-pin JTAG, 10-pin and 20-pin Cortex connectors permits real-time debugging and tracing of the programs.KEIL MDK has been used on the host side as an integrated development environment to observe, analyse, verify and optimise the algorithms.The algorithms' flashing, debugging and tracing have Fig. 4 Bit-band method of performing inverse permutation for the PRESENT block cipher Fig. 5 Bit-band method of performing permutation for the GIFT block cipher Fig. 6 Bit-band method of performing inverse permutation for the GIFT block cipher been carried out using advanced debug adapters from ARM, namely ULINKpro and ULINKplus.The RAM and ROM memory usage of the primitives can be calculated using either debug adapters.Power measurement has been explicitly performed using ULINKplus debug adapter.Also, the streaming trace capability with ULINKpro permits complete module and function-level instruction tracing for longer, thus providing detailed execution timing information.Moreover, the energy consumption of the ciphers has been calculated as Energy (in µJ) = Power (in mW) * Time (in ms).Moreover, several compiler optimisation techniques have been used to increase the code and speed efficiency of the direct and proposed implementation methods.In addition, a highly optimised set of libraries known as micro-lib has been used that helps reduce the overall flash footprint of the block cipher primitives to a marginal extent.

Results and discussions
The simple mapping between the bit-band and bit-band alias regions with the preclusion of multiple masks, shift, and XOR operations make the bit-band permutation method the most time-efficient.
Table 6 tabulates the results for various performance metrics such as power, energy, execution time and memory (RAM and ROM) utilisation for the PRESENT block cipher implemented using the proposed bit-banding technique and the direct method.
The percentage difference in various lightweight metrics for the bit-band and direct implementation methods has been reported for a better comparative analysis.In addition, separate computations for the encryption and decryption phases of the PRESENT block primitive have been listed.For the encryption part, the maximum improvement has been reported for the execution time, with the bit-band technique requiring 68.96% less time than the direct method.This is followed by 42.43% and 17.82% reductions in energy and power consumption, respectively.The only downside of the bit-band technique is a comparatively higher memory requirement in terms of RAM and ROM footprints.Similar trends have been observed for the decryption results, with the bit-band method outperforming the direct method by 27.31%, 45.15% and 82.53% improvements in power, energy and time requirements.However, the bit-band method entails a slightly larger memory size than the direct approach.
Figure 7 presents the encryption results of various evaluation metrics for the direct and bit-band implementations of the PRESENT block cipher, executed with the different optimisation levels (as listed in Table 5) available with the KEIL IDE.This evaluation has been made to evaluate the performance of the proposed technique in different compiler optimization levels.
Significant improvements have been obtained for the bit-band method with 68.58% and 86.63% reduction in power and energy consumption by utilising -O2 compared to the -O0 technique.Furthermore, a speed improvement of 56.45% has been attained using the timeoptimized -O3 level.Moreover, the overhead with the bit-band-based permutation technique has subsided by 46.02% with the -Ozimage optimisation level.In the case of the direct method, the -O2 level improves power and energy consumption by 53.163% and 78.917%, respectively, compared to the -O0 level.Also, with the -Os balanced level, a 69.14% reduction in execution time has been observed.Finally, more than a 50% decrease in memory footprint is obtained using the -Ozimage optimization level.
Figure 8 presents the decryption results for the PRE-SENT cipher run with different compiler optimization levels (Table 5).In the case of the bit-band method, 84.83% and 93.97% improvements in power and energy consumption have been observed with the -O2 level compared to the -O0 level.A 78.49% less time for decryption is reported with the -O1 level.The -Oz image size reduces the memory requirements by half.For the direct method, as compared to the -O0 level, a 70.01%and 89.94% decrease in power and energy consumption has been obtained using the -O2 level.A 51.29% decrease in memory requirements has been possible with the-Oz image level.Above 70% reduction in decryption time is made with -O2, -O3 and -Ofast optimization levels.Table 7 presents the performance evaluation results in terms of various metrics for the GIFT block cipher, implemented using the direct and the proposed bit-band methods on the LPC1768 development board.It also enumerates the percentage difference in various lightweight metrics for the bit-band and direct implementation techniques.
The execution time is reported to show maximum improvement, with the bit-band method requiring 56.42% less time than the direct method.This is followed by a 14.76% and 4.25% reduction in energy and power requirements, respectively.Again all these improvements in the bit-band method are at the cost of a relatively higher memory footprint than the direct method.For the decryption part, a similar trend is followed with 1.11%, 11.28% and 10.28% improvements in power, energy and time performance metrics.However, the memory size is comparatively larger in the bit-band than in the direct method.
Figure 9 depicts the comparative encryption results for various lightweight metrics of the direct and bit-band implementations for the GIFT block cipher run with different optimisation levels (Table 5).
Remarkable improvements have been attained for all metrics of the bit-band method, with 90.38% and 98.57% reductions in power and energy consumption reported with the -O2 technique compared to the -O0 technique.In addition, the execution time has been reduced by 70.90% using high-speed -O3 and -Ofast techniques.Moreover, a more than 50% decrease in the memory footprint has been achieved using the most code efficient -Ozimage optimisation level.For the direct method, with the -O3 level, 84.39%, 97.73% and 89.89% reductions in power, energy and time utilisation have been reported in comparison with the -O0 results.Also, a 32.752% reduced memory size has been achieved using the -Oz image size level.Figure 10 shows the direct and bit-band decryption results for the GIFT cipher using various optimisation levels (Table 5).
For the proposed method, the -O3 level reports a 60.62% and 93.1% decrease in power and energy requirements to the -O0 level.Also, a 24.85% reduction in memory size is possible with the -Oz image level.The -O0, -O3, -Ofast and -Oz image levels report almost the same decryption times.For the direct method, in comparison to the -O0 level, an 85.77% reduction in decryption time is attained using the high-speed -O3 and -Ofast optimisation levels.34.19% reduction in memory footprint has been reported for the -Oz image level.67.18% and 95.33% decrease in power consumption have been possible with the -Ofast optimisation level.From the results obtained, it can be inferred that the proposed bit-banding method for performing permutations in the PRESENT and GIFT SbPN ciphers is highly efficient in energy, power and execution time.In the direct method, each sub-operation involved in the bit-rotation method adds to the instruction count, increasing multiple instruction fetch, decode, execute and write-back operations.This is more apparent in lightweight SbPN ciphers with large block sizes and a larger number of rounds.For PRESENT and GIFT primitives with block size = 64, the input to the diffusion layer is large.Each bit transposition requires at least four sub-operations, viz., mask generation, AND or masking, shifting by a specified number of bits and XORing the diffused bit state with the original input state.This amounts to 4 × 64 = 256 such operations for realising a single round permutation.For the PRESENT cipher with 31 rounds involving permutation operation on the 64-bit state, 64 × 4 × 31, such operations must be carried out by a low-end IoT device.Similarly, for the 28-round GIFT 64/128 cipher, 64 × 4 × 28 sub-operations are required.
Contrary to this high instruction count and resourceexhaustive bit-rotation method, bit-banding is a software-efficient linear layer implementation technique.Moreover, this method does not involve using bit-manipulation instructions to perform the diffusion, offering a cost-saving option for low-end IoT processors.Instead, a simple mapping between the bit-band and its corresponding bit-band alias region is necessary to perform the bit-wise permutation, thus not only saving the chip space on processors but also leading to faster execution time and reduced power and energy consumption.Implementing the diffusion layer using the proposed technique not only reduces the instruction count, but also results in a significant decrease in all the lightweight design metrics namely power, energy and timing requirements, making the use of SbPN ciphers ideal for low-cost, lowpower and low latency applications.Above all, since the proposed method is only an implementation strategy and does not modify the structure of the primitives, therefore, The fallout of the bit-banding method is the comparatively large memory requirements.Since each bit in the bit-band region corresponds to 32 bits in the bit-band alias region of SRAM.As such, the input and output state to the permutation layer of 64-bit width together occupies 2 × 64 × 4 = 512 bytes in the bit-band alias memory.This makes bit-band permutation less memory efficient than the direct method; however, this memory requirement is much smaller than what is available with most IoT devices.

Conclusion
This paper presents a highly software-efficient method for performing bit-permutation-based diffusion using the bit-manipulation bit-banding technique with the leading edge ARM Cortex-M processors.A simple mapping between the bit-band and its corresponding bit-band alias region is necessary to perform the bit-wise permutation, thus saving the chip space on processors and leading to faster execution time with reduced power and energy consumption.Compared with the direct implementation methods for the PRE-SENT and GIFT ciphers, the bit-banding technique reports substantial reductions in power, energy and time requirements.All these improvements result from decreased instruction count and a fast mapping between the bit-band and bit-band alias regions.The only drawback of this method is an increase in the memory footprint, which is not much of a concern for ARM Cortex-M-based smart IoT devices.Furthermore, the proposed technique has been subjected to various compiler optimisation techniques available with the KEIL MDK IDE.The results have shown that with -O2 level, GIFT and PRESENT block ciphers significantly improved energy and power efficiency, whereas -O3 and -Ofast have sped up the cipher designs by a considerable mark.Moreover, high code efficiency is attained with '-Ozimage size' optimisation but at the cost of an increase in execution time.Although the proposed technique has been implemented to improve the

Fig. 2
Fig. 2 Encryption process of the GIFT block cipher

Fig. 7
Fig. 7 Performance comparison between direct and proposed implementation techniques for the PRESENT block cipher (encryption) utilising different compiler optimisation levels

Fig. 8
Fig. 8 Performance comparison between direct and proposed implementation techniques for the PRESENT block cipher (decryption) utilising different compiler optimisation levels

Fig. 9
Fig. 9 Performance comparison between direct and proposed implementation techniques for the GIFT block cipher (encryption) utilising different compiler optimisation levels

Fig. 10
Fig. 10 Performance comparison between direct and proposed implementation techniques for the GIFT block cipher (decryption) utilising different compiler optimisation levels

Table 1 S
-box of the PRESENT block cipher

Table 2
Permutation table of the PRESENT block cipher

Table 3 S
-box of the GIFT block cipher

Table 4
Permutation table of the GIFT block cipher

Table 5
Advantages and drawbacks of various compiler optimisation levels

Table 6
Performance comparison of the proposed implementation technique for the PRESENT block cipher

Table 7
Performance comparison of the proposed implementation technique with the direct method for the GIFT