Revisiting frequency-smoothing encryption: new security definitions and efficient construction

Deterministic encryption (DET) allows for fast retrieval of encrypted information, but it would cause significant leakage of frequency information of the underlying data, which results in an array of inference attacks. Simply replacing DET with fully randomized encryption is often undesirable in the scenario of an encrypted database since it incurs a large overhead in query and storage. Frequency Smoothing Encryption (FSE) is a practical encryption scheme to protect frequency information. Current FSE constructions still fall short of efficiency and a reasonable security definition. We revisit FSE and propose two security definitions from both theoretical and practical perspectives. Furthermore, we adopt a novel partitioning strategy to construct a new FSE scheme to improve performance. Experimental results show that compared with others, our scheme achieves excellent query performance while attaining security against inference attacks.


Introduction
In the era of big data, encrypted databases (Poddar et al. 2019;Popa et al. 2011;Tu et al. 2013;Arasu et al. 2013;Antonopoulos et al. 2020;Zhu et al. 2021) serve as a vital component of the cloud computing infrastructure to protect users' data privacy.In the construction of encrypted database systems, Property-Preserving Encryption (PPE) and Deterministic Encryption (DET) are often adopted to allow for efficient retrieval directly on encrypted data.These algorithms leak frequency information about the underlying datasets, which severely undermines the security of the encrypted databases.Inference attacks (Naveed et al. 2015;Grubbs et al. 2017;Cash et al. 2015;Bindschaedler et al. 2018) can recover the plaintext with the frequency information.Fully randomized encryption can completely eliminate such security risks.However, when applying it in the settings of database systems, the cost of security might outweigh its benefits as the query complexity is quite formidable.Frequency Smoothing Encryption (FSE) (Lacharité and Paterson 2018) aims to balance the tradeoff between security and efficiency.In contrast to conventional fully randomized encryption, FSE allows for repetition of ciphertexts for a specific plaintext, while subject to rigorous control on the number of repetitions by the scheme's security parameter.Therefore, we highlight the significance of FSE in the development of secure and efficient database systems.
There are no proper security definitions for FSE.The inadequacy of FSE schemes to achieve indistinguishability under chosen plaintext attack (IND-CPA) security also poses a significant challenge.The problem is how to quantify or constrain the leakage, which has not been satisfactorily addressed in prior FSE schemes.Lacharité et al. (Lacharité and Paterson 2018) proposed the homophonic encoding approach to model the adversary using the Kolmogorov-Smirnov test (Massey Jr 1951).It does not provide a clear understanding of the information leakage, and the authors only empirically evaluate it through the robustness against inference attacks (e.g., MLE attack (Lacharité and Paterson 2018)).The Poisson salt allocation scheme described in (Pouliot et al. 2019) does not offer any security metrics that measure the leakage.This scheme is vulnerable to a form of frequency analysis based on the knapsack problem.Grubbs et al. (Grubbs et al. 2020) uses fake queries to protect the access pattern of the key-value store.It does not provide a proper security definition.
The lack of proper security definitions for FSE causes the bad performance of previous schemes.The time complexity for querying a plaintext using the homophonic encoding strategy depends on a security parameter.As the security parameter decreases, the number of unique ciphertexts for a given plaintext increases.Although the security parameter is adjustable, as previously mentioned, this parameter provides very limited information about the scheme's ability to restrict information leakage.In addition, the salt sampling stage as well as the query time complexity in the Poisson salt allocation scheme can be relatively slow if the exponential distribution parameter is large.The data owner may choose an inappropriate value of security or distribution parameter that causes a dramatic increase in query complexity.

Our work
We propose two proper security definitions that help to qualify the leakage and guide data owners to choose appropriate parameters.We adopt a novel partitioning approach to construct an efficient FSE scheme (named PFSE) for deterministically encrypted databases.
Challenge 1: Defining security for an FSE scheme with the absence of a clear metric.To overcome this, we choose inference attack as a suitable measure.To model the adversary's behavior in such attacks, we can compare their advantage when using a baseline dataset encrypted by deterministic encryption to that of a dataset processed by the FSE scheme.This comparison helps us understand the adversary's advantage and may serve as a paradigm in the design of FSE schemes for encrypted databases.
Challenge 2: Improve the performance of FSE schemes.Previous schemes either incur large overhead or introduce false positives.To address this, our scheme adopts a partitioning strategy that divides the message set into several partitions and smooths each partition independently.The partitioning strategy improves performance by avoiding the overhead when considering the dataset as a whole.Additionally, our scheme allows for flexible partitioning using a user-defined function, which further optimizes the performance.
Challenge 3: Define the proper security requirements for partitioning.We adopt the partitioning strategy to construct our PFSE scheme.It is crucial to prevent any information leakage to the adversary during the partitioning phase.We consider the adversary's limitation of not knowing the messages' order.Following the security definition proposed in (Pouliot et al. 2019), we extend it to PFSE and call it IND-PCUDA, which stands for Indistinguishability Under Partition-based Chosen Unordered Database Attack.

Our contributions
We summarize our contributions as follows.
• We propose new security definitions for FSE schemes, both theoretically and practically, facilitating the design of FSE schemes.• We present a novel FSE framework that leverages the partitioning strategy.This scheme improves the database performance and is more efficient than previous ones.• We evaluate the security and overhead of our scheme rigorously.Micro-and macro-benchmarks are conducted and compared with existing schemes (Lacharité and Paterson 2018) which are state-of-the-art FSE schemes for deterministically encrypted databases.Experimental results demonstrate that our scheme incurs several server-side storages but significantly improves the query performance while attaining security.

Related work
Deterministic encryption (Fuller et al. 2015) and orderpreserving encryption (Popa et al. 2013;Kerschbaum 2015;Li et al. 2021) cannot hide the frequency information about the underlying dataset in the encrypted database.The leakage of frequency makes schemes vulnerable to inference attacks (Naveed et al. 2015;Bindschaedler et al. 2018;Grubbs et al. 2017;Durak et al. 2016).Protecting the frequency information with fully randomized encryption or structural encryption (Kamara and Moataz 2018) faces a huge overhead of storage and computation.The notion of frequency smoothing encryption (FSE) (Lacharité and Paterson 2018;Grubbs et al. 2020) can protect the frequency information with a small overhead.Lacharité et al. (Lacharité and Paterson 2018) proposes an FSE scheme for deterministically encrypted databases by homophonic encoding strategies.Based on the Poisson salt allocation, Pouliot et al. (Pouliot et al. 2019) proposes weakly randomized encryption which can be seen as a variant of FSE to allow efficient search over ciphertext.Lacharité et al. (Lacharité and Paterson 2018) points out that this scheme is vulnerable to a form of frequency analysis based on the knapsack problem, and their bucketed countermeasure will increase the rate of false positives.The salt sampling algorithm of their scheme is not efficient enough.In addition, Grubbs et al. (Grubbs et al. 2020) propose an FSE scheme for the keyvalue store, which uses fake queries to protect the access pattern.

Notation
We start by defining some important symbols.Let M be the set of plaintext messages and C be the set of cor- responding ciphertexts.The plaintext dataset is represented as the multiset M = {m 1 , m 2 , . ..} over M and the ciphertext dataset is represented as the multiset |M| to represent the probability mass function of M and n M (•) to denote the count of each element in M. The histogram of a dataset M is represented as an array H M in which the i-th position has the value n M (m i ) such that H M (i) ≥ H M (j) if i < j .The i-th plaintext in H M is represented by m i .Lastly, we use Pr[e] ∈ [0, 1] to denote the probability of event e in a given probability space .A deterministic symmetric encryption is represented by DET = (Gen, Enc, Dec) with a secret key sk, and a keyed pseudorandom function is denoted by F sk .A summary of these symbols can be found in Table 1.

Frequency-smoothing encryption
A partition-based FSE scheme, namely, PFSE, consists of the following algorithms.

Threat model
Our model assumes that a passive adversary somehow obtains a snapshot of the encrypted data, but it does not interfere with the normal functionality of the DBMS.The adversary also has the auxiliary dataset that may come from the Internet, accidental leakage of previous database records, or other illegal sources, and we assume that the adversary knows the exact distribution of the data.
Adversaries that exploit the query pattern to recover the database records are out of scope.The adversary's task is to recover which plaintext is the corresponding one to each ciphertext in the snapshot based on the auxiliary dataset.The same threat model is also captured in the work of (Ceselli et al. 2005), which they call the adversary Freq + DB k , where Freq denotes the auxiliary infor- mation on the distribution of the plaintext data and DB k denotes the encrypted database.

A new perspective about security of FSE
Before we introduce our frequency-smoothing encryption (FSE) scheme, it's important to establish a standard for security.Given the potential for inference attacks The partition function parameterized by θ based on snapshots of the encrypted database and the limitation that the adversary cannot know the order in which the database was created, we will outline both a theoretical and practical security definition.To ensure that there is information leakage of the partitioning technique, the theoretical definition adheres to traditional security principles (i.e., security games and indistinguishability) in cryptography, while the practical definition takes into account an attacker utilizing frequency information from the dataset via an inference attack.

Practical security definition
To properly assess the security of FSE, we must first understand the power and tactics of potential attackers, as well as what it means for the scheme to be considered "broken".FSE was designed to resist inference attacks.Therefore, it is critical to evaluate the security of FSE from the perspective of statistical inference.FSE schemes aim to minimize or limit the success rate of the attacker.However, we have observed that simply bounding the absolute rate of recovered plaintexts can be misleading.This is because the distribution from which plaintexts are drawn can significantly impact the success rate.For example, consider a dataset with only two possible values, where 99% of the records are labeled ALIVE and only 1% are labeled DIED.Even a fully randomized encryption scheme would not prevent an attacker from guessing the correct ciphertexts, since the attacker could simply assign all ciphertexts to ALIVE and achieve an accuracy of at least 99%.Thus, it would be unfair to attribute the failure of FSE robustness against inference attacks solely to the scheme.
Therefore, we propose a relative approach to defining security, as we are more interested in the extent to which FSE reduces an attacker's success rate on a sanitized dataset.This is in line with the core purpose of FSE, which is to defend against inference attacks.To formalize this approach, we introduce the inference experiment E Inf,A,M , which generalizes the inference attack and cap- tures the adversary's behavior.In the following section, we further discuss this approach and its limitations.
Inference Experiment E Inf,A,S (M) : For a given data- set M, a (possibly) randomized encryption scheme S : M → C * , and a P.P.T. adversary A , it performs the inference experiment as follows.
1 The adversary fixes some distribution D, and randomly samples n records from D. Denote this dataset by M with a support Supp(M) of size k and the fre- quency of each message m is given by f G (m). 2 A sends the message dataset M to the challenger.
Following the inference experiment, we model the algorithm Inf that A utilizes.We call the adversary optimal if it aims to find the solution to the maximization or equivalently, by the fundamental bridge, in which the expectation can be rewritten as where, without loss of generality, we assume the attacker accurately divides the ciphertext set (i.e., all possible ciphertexts) C into k non-interleaving parts.That is, C = k C i where C i is the i-th partition and C i ∩ C j = ∅, ∀i � = j .Denote the weight of each set C i as g C (C i ) .The adversary's guessing strategy is Obviously, from rearrangement inequality, for DET scheme, when Definition 1 (Security Against Inference Attacks) For a given P.P.T. adversary A , let DET denote the deter- ministic encryption scheme, S be a randomized encryp- tion scheme, and κ be the security parameter.Define the advantage of A as follows. (1) We say S is δ-secure if for the budget for frequency smoothing δ ∈ (0, 1) , there exists a negligible function negl(κ) such that Intuitively speaking, a smaller δ results in a larger overhead for the FSE scheme while a larger δ results in a smaller overhead.The data owner must carefully choose δ to achieve a balance between usability and security.Although sometimes, the trivial guessing attack that assigns all ciphertexts to the most possible plaintext would perform better than a statistically optimal adversary, we believe that such cases are outliers and do not apply to all datasets.We believe an adversary will always try to be statistically optimal.

Extending to partition-based FSE schemes IND-PCUDA security
For partition-based FSE schemes, there is another security constraint enforcing that the partitioning phase cannot leak extra information to the adversary, and we aim to capture it through traditional cryptographic indistinguishability. Unfortunately, to allow equality queries to be executed efficiently, our scheme is by no means resistant to chosen-plaintext attack (CPA) in the conventional cryptographic security definition.We therefore carefully devise a weaker definition after the standard Indistinguishability under Chosen-Plaintext Attack (IND-CPA).Pouliot et al. proposed a security definition called Indistinguishability under Chosen Unordered Database Attack (IND-CUDA) (Pouliot et al. 2019), where the adversary that utilizes the snapshot cannot know in which order the database was added.We aim to extend this security definition to partition-based FSE schemes.Let θ be a set of public parameters (i.e., the adversary A can access it) and partition function f θ used in the PFSE scheme, we define the security game between a challenger and A with a security parameter κ as follows. Game 1. Initially, A fixes some plaintext space M and gener- ates a pair of two message datasets M 0 , M 1 defined over M .Let π(A) denote the permutation set for a given set A. The generation proceeds as follows.
(a) Fix the total number of messages as n and partition number as k.PFSE is said to be IND-PCUDA secure in the presence of a P.P.T adversary A if the following holds.
where the probability is taken over random coin tosses of A , and κ is the security parameter.
The above security game and the definition is pessimistic: we assume the P.P.T. adversary A has full knowl- edge over the original dataset M.This is also rather reasonable because, in the real world, such adversaries are the strongest.Another point one should note is that the theoretical security game requires that the number of messages and that of categories in each partition are fixed, but does not pose any limitation on how the multinomial distribution is crafted.The scheme is secure as long as the adversary modeled in the security game has a negligible probability of winning the game.
For the part of practical security, an appealing property of the definition mentioned in Definition 1 is that it allows for easy extension to PFSEs.In a similar fashion, by linearity of expectation, we define the partitionbased adversary's advantage: assuming the number of partitions is k.

Partitioning
In this section, we describe the key technique that is used in our FSE scheme, the partitioning strategy.An important insight into why partitioning seems a suitable candidate for frequency smoothing is that it groups together messages with similar frequency information.In the real world, there exist numerous dataset that is unevenly distributed and some may have distorted frequency information.If we regard the dataset as a whole and want to hide the frequency information of each message, we may need to do some alignment: messages with high frequencies need to be padded to those with low frequencies.The cost of smoothing, therefore, will grow larger.
In our scheme, we aim to smooth the message dataset by the partitioning strategy.In particular, the data owner may apply a partition function f θ with a parameter set θ to the whole dataset and divide the histogram of the original dataset into several partitions according to f θ .θ can be cho- sen so as to best fit the distribution of the dataset.Specifically, we partition the dataset M so that where i indicates the index of each partition.
Splitting the message.One should note that partitioning is not always feasible.Consider some partition G i and a message m ′ next to G i .If , then m ′ cannot be placed into G i , and partitioning falls short of the desired property, as shown in Eq. 11.To overcome this challenge, we split the message m ′ if it resides at the intersecting point of two par- titions.Figure 1 illustrates the partitioning algorithm.
The algorithm PFSE.partition in Fig. 2 shows the detailed steps of partitioning, which is quite straightforward: • Step 1: Compute the histogram (sorted by frequency in descending order) of the given dataset M. • Step 2: Compute the value of f θ (i) for each index i and try to group together the next message.If the last message m ′ cannot be fit to the current partition G i , then split m into two parts according to proportion δ which is given by The first part is δ • n M (m ′ ) , and the second part ) is added back to the histogram in a sorted manner according to the frequency, which done by the utility function addSorted. (10)

Transforming and smoothing the dataset
Transformation of the original dataset in partitions can be thought of as "duplicating" the messages according to the parameter k i and padding dummy records to each partition.An example in which the partition number k = 3 is depicted in Fig. 3. Formally, let n i denote the number of distinct messages (including dummy records), r i denote the number of distinct real messages and d i denote the number of distinct dummy records.We are interested in the following equation.
Note that the size of ciphertext sets for each message m is given by k i • n M (m) , the above equation can be rewritten as In our scheme, for each partition where k is the total number of partitions.In addition, the constraint that d i ≥ 0 should be satisfied so that the scheme is always valid for every partition G i .To satisfy this constraint, we set n i as follows.
where t = m∈G i f 2 M (m) and a = m∈G i n 2 M (m).Dummy records.To fulfill the security requirement of achieving δ-security, merely replicating messages from the original dataset is insufficient, as an adversary could still derive significant information about the underlying plaintexts.In order to enhance the obfuscation of the dataset distribution, we introduce additional dummy (12) Fig. 1 The process of partitioning the histogram obtained from the original dataset M. The message m 6 cannot perfectly fit into the second partition, so we split m 6 into two parts according to the portion δ and insert the second part back into the histogram Fig. 2 The building blocks for the partition-based frequency-smoothing encryption scheme records into each partition, after determining the number of replicas required for each message.Our proposed scheme achieves this by adding random bit strings that are uniformly drawn from the entire message space and encrypted as dummy records, thereby further obscuring the true dataset distribution.

Local table.
On the client side, our privacy-preserving scheme requires the creation of an additional table, denoted as T , which maps each message m to a set of tri- plets i, j, k .The triplet notation consists of the index i of the partition that message m belongs to, j which represents the size of the set of ciphertexts for message m, and k which indicates the count of each ciphertext in the set.To illustrate, consider the example shown in Fig. 3, where �i, 3, 9� ∈ T [m 1 ] , �i, 2, 6� ∈ T [m 2 ] , and �i, 1, 3� ∈ T [m 3 ] .It is worth noting that the reason why each value in the hashmap T is a set is due to the possibility of messages being split into multiple portions, resulting in each message residing in several partitions.Additionally, it is important to highlight that the number of records in the local table T is precisely the number of distinct messages in the original dataset M, such that |T | = N .To ensure the privacy requirement of our scheme, it is necessary to store this table on the client side.
The main procedure is presented in PFSE.transform (see Fig. 2), which consists of the following steps.

Searching
When searching for a specific keyword in a database using the PFSE scheme, the first step is to generate all search tokens by encrypting the keyword w, followed by sending an OR predicate to the database system.The searching phase is detailed in Algorithm PFSE.search (refer to Fig. 2).It's worth noting that the algorithm needs to cross-reference the local table T to verify the validity of the given keyword w.Once the encrypted record set R is obtained, the PFSE.decryptalgorithm (see Fig. 4) filters out dummy records.In the case of a relational database system, the PFSE scheme sends a query of the form Remark It is also possible for the PFSE scheme to be made applicable to searchable encryption.For example, for each message m, the search tokens can be obtained by t ← F k (m||i||j), ∀�i, j, k� ∈ T [m] , where F k is a keyed pseudorandom function (e.g., HMAC).The ciphertext is obtained by any IND-CPA symmetric encryption scheme such as AES-GCM with a secure random nonce.
3 The transforming example.We consider three messages m 1 , m 2 , m 3 , and their respective occurrences in the original dataset: n M (m 1 ) = 9 , n M (m 2 ) = 6 , and n M (m 3 ) = 3 .To determine the number of replicas for each message, the value k i is calculated as Thus the total number of replicas r i is 9.The number of dummy records, d 1 , d 2 , and d 3 , is then determined as n i − r i Fig. 4 decryption algorithm Adopting PFSE to construct a searchable encryption scheme can make the server-side index more efficient.

Security analysis
We prove in this subsection that the aforementioned FSE scheme is both theoretically and practically secure.
Informally speaking, for the theoretical part, the security of FSE derives from the use of secure cryptographic algorithms, i.e., the symmetric encryption algorithm that is strong enough to resist cryptanalysis, and the fact that any chosen distribution for each partition is uniform after smoothing.For the practical part, we carefully choose the parameter k i and n i so that the advantage of inference attack gained by the adversary can always be bounded below a given threshold δβ.
We now present the main theorem that establishes the theoretical security of the PFSE scheme as follows.
Theorem 1 The above PFSE scheme that uses a real function f θ : N → (0, 1] parameterized by θ is IND- PCUDA secure as long as the underlying deterministic encryption DET = (Enc, Dec, Gen) is secure against known-ciphertext attack (KCA) with regard to a security parameter κ.

Proof
The idea of the proof follows the standard reduction method used in cryptography.We first reduce proof of security over the entire message dataset M into proof of security over each partition G i that can be further reduced into the security of the underlying symmetric encryption scheme.To begin with, let k denote the number of partitions and A denote a P.P.T. adversary in the IND-PCUDA If A succeeds in the security game, it means that there at least exists one index i ∈ [k] such that A is able to dis- tinguish between G 0 i and G 1 i with different orders.Denote this event by G π ,A i .Thus, with union bound, one obtains Consider arbitrary partition G i .Note that k i = f θ (i)  k , so the count of each replica for every message within G i is always k f θ (i) ; thus A is unable to distinguish between G 0 i and G 1 i by replicas.Now consider the count of each dummy record.Since our PFSE scheme enforces that it should be the same as the number of each replica, any two permutations of the current partition G i will be smoothed into the same dataset where the occurrence of each element is exactly the same.Thus, we know (15) assuming that DET is secure against known-ciphertext attack, where DET A,κ denotes the event that DET is bro- ken by A.
Rewriting Eq. 16 in Eq. 15, we derive the following.
which completes the proof.
Next, we turn to the main theorem of the practical security part.Before proving the main theorem, we introduce another lemma that facilitates the proof.
Lemma 1 Our scheme is δ-secure if the following holds. where , n i is the number of messages in G i , and n is the number of messages.
Proof For each partition G, let N denote the number of messages in G, k i denote the scaling factor, n M (m) denote the count of each message m, and C m denote the cipher- text sets for m.With union bound and the fact that the eventual distribution of ciphertexts is uniform, for every message

we immediately obtain
Hence, it is straightforward to see that This completes the proof.Theorem 2 Our FSE scheme is δ-secure against the inference attack as per Definition 1.
Proof We know that ∀i ∈ [k] , the scheme enforces k and where t The theorem follows from applying Lemma 1.

Performance analysis
During the partitioning phase, the algorithm groups messages from the original dataset M into partitions, iterating over the histogram H(M).The time complexity of this phase is given by O(N + n + c) , where the small constant c arises from the need to split messages on the intersecting point of two partitions.This constant is negligible and thus ignored in the subsequent analysis.
During the transforming and smoothing phase, the algorithm duplicates messages and pads them with dummy records, iterating over the partition set obtained in the previous phase.The time cost of duplicating messages is O(N), while that of padding dummy records is given by O( nk δβ ) .Smoothing incurs a time cost of O( nN k ) , leading to an overall time complexity of O(N + nk δβ + nN k ) for this phase.
Finally, the search phase's time complexity depends only on the size of its replicas, with the cost of searching for a single message m given by O( n M (m) k ).Regarding storage, the client's storage cost is O(N), with a local table T stored on its side.On the server side, stor- age primarily comprises dummy records and ciphertexts of messages, with size n i for all i ∈ [k] .The server's stor- age (excluding the index) can be calculated as k i∈[k] n i , which yields O(nk).

Evaluation
We evaluate our scheme by comparing its performance with LPFSE schemes (including LPFSE-IBHE and LPFSE-BHE) which are state-of-the-art frequency-smoothing encryption schemes.We evaluate the running time of initialization and query operation, and the storage on the client and server.Furthermore, we measure the success rate of recovered messages under MLE attack (Lacharité and Paterson 2018) which is an efficient inference attack.We set deterministic encryption (DET) and completely random encryption (RND) as two baselines for fairness.

Implementation details
We implement the Partition-based FSE (PFSE) and LPFSE schemes (including LPFSE-IBHE and LPFSE-BHE schemes) (Lacharité and Paterson 2018) in Rust for efficiency and memory safety.The code is open-sourced on GitHub (See https:// github.com/ hiroki-chen/ PFSE-Proto type).Our experiments utilize MongoDB 6.0.3 running on a PC with dual Intel(R) Xeon(R) Gold 6248R CPU @ 3.00GHz CPU (48 cores, 96 threads in total), 128 GB RAM, 1Gbps network, and the operating system is Ubuntu 20.04 LTS.The AES-GCM (256 bits) algorithm with a fixed nonce (12 zero bytes) is used for deterministic encryption and a secure random nonce is used for fully randomized encryption.The auxiliary dataset is the same as the plaintext, which is the strongest scenario for the attacker.

Experiment setting
We create an index on the encrypted attribute for each experimental suite and deploy the client and server on the same machine to reduce the impacts of network latency.To measure the query benchmark, we randomly draw 100 keys from the dataset independently and send queries to the database to simulate a real-world scenario.All experiments run 10 times in single-thread mode, and results are averaged.
Datasets.Our scheme works on a variety of data types.We adopt two datasets in our experiments.One is the Supermarket Dataset for Predictive Marketing (SDPM) (Hunter 2023) which comprises a total of 2, 019, 501 records that capture E-Commerce customer behavior.The other is the American Community Survey (ACS) dataset (Bureau 2015) which consists of 1, 618, 489 valid records.To investigate the impact of different attribute domains and distributions on the performance of our FSE schemes, we select four columns from each dataset.For a comprehensive overview of the metadata associated with these columns, refer to Table 2.
Settings of the LPFSE schemes (Lacharité and Paterson 2018).The optimal choice of the security parameter ε for homophonic encoding strategies is a topic that has not been satisfactorily addressed in (Lacharité and Paterson 2018).To ensure a fair comparison, our experiment takes two steps.Firstly, we adopt the typical choices of r min as outlined in (Lacharité and Paterson 2018), though we use a different dataset as their dataset is currently unavailable.Secondly, we apply two variants (Section 4.2 in (Lacharité and Paterson 2018)) to optimize performance.Here, r min refers to the minimum encoding length for each attribute in the LPFSE-IBHE scheme (Lacharité and Paterson 2018), and we set a similar r for attributes similar to those evaluated in their work.It should be noted that our dataset size is greater than that of the original study, and to address this difference, we increase the encoding length r by 2. The choices of r for each scheme are also listed in Table 2.For the LPFSE-BHE scheme (Lacharité and Paterson 2018), we calculate ε by fixing r and setting the same value for all attributes.

Performance evaluation Microbenchmarks
In this section, we analyze the performance of the PFSE scheme under various settings.We evaluate four key metrics: server storage, client storage, initialization time, and query time.The data column we used is order_ number (refer to Table 2), and for various dataset sizes n, we shuffle the original dataset randomly and select n records from it.
Performance with different f θ .Without loss of generality, we fix the partitioning function f θ (x) as f (x) = e − x and evaluate the effect of different choices of to simulate various partitioning functions.We test how ∈ {0.25, 0.50, 0.75, 1.0} affects the performance with different dataset sizes n from 10 3 to 10 6 .We set the security parameter δ to 0.1 in this experiment.
We present a detailed analysis of experimental results obtained from our PFSE scheme, as depicted in Fig. 5.We highlight the scheme's performance in initialization, comparing it with the insecure baseline (DET) and RND.The experiment shows that the initialization time  5b).This outcome is attributed to the decreased number of partitions k as increases, thereby necessitating more replicas for each message to maintain privacy (one should recall that the number of replicas for each message is given by f θ (i)n M (m) k ).In terms of storage overhead, our scheme has a minimal impact on the client side.The storage requirements of the client side are only O(N), which results in a negligible increase in storage.The server experiences greater storage overhead due to the increased storage requirements.Compared with RND, the blowup lies in 100 − 1000× when the number of data is 10 6 , but the effect of on the server side is relatively small.
Consistent with the mathematical analysis that reducing δ leads to more time cost for database initialization and query processing (refer to Fig. 6a, b).A smaller δ leads to increased query time due to larger server storage and longer quey time required by the database backend.In the worst case (i.e., δ = 0.10 ), the query time is only ∼ 1.03× slower than that of DET and performs far better than RND.Furthermore, unlike the effect of f θ on storage overhead, a smaller δ entails greater server-side storage.

Comparison with LPFSE schemes
We compare the query performance of PFSE, LPFSE-IBHE, and LPFSE-BHE, across various attributes.For the PFSE scheme, we use a fixed partitioning function of f (x) = 0.5e −0.5x , and the choices of δ are provided in Table 2.As shown in Fig. 7, the PFSE scheme outperforms both LPFSE-IBHE and LPFSE-BHE on most attributes in the dataset, except for those with small domains such as reordered and SEX.PFSE scheme incurs only a small query overhead, and the performance approximates that of the DET scheme.An interesting observation from our experiments is that the LPFSE-IBHE scheme experiences significant performance degradation when the dataset has a skewed distribution, with only a few messages appearing frequently while most appear infrequently.This is due to the security policy of the scheme, which enforces a minimum encoding length of r ≥ log 2 . In fact, on most attributes, the LPFSE-IBHE scheme is even slower than RND due to the large search space incurred by the encoding strategy.Thus, on attributes with skewed distribution such as add_to_cart_order, our PFSE scheme achieves a significant speedup of ∼ 510× compared to the LPFSE- IBHE scheme.

Security evaluation against inference attack
We concentrate on assessing the security of FSE schemes against state-of-the-art inference attack (Lacharité and Paterson 2018).The security is evaluated by the weighted average rate of messages that the attacker is able to recover.Formally, the recover rate α is calculated as follows.
where #R m denotes the number of the correct guesses for m, and C m denotes the ciphertext set for m.Firstly, we evaluate how partitioning function f θ and the security parameter δ would exert effects on the PFSE's robustness against inference attacks.Then, we compare the security of the PFSE scheme with that of LPFSE-IBHE and LPFSE-BHE.
Security with different f θ .Similar to the settings in the performance evaluation, we also set the partitioning function to f (x) = e − x with ∈ {0.25, 0.50, 0.75, 1.00} and δ = 0.5 to better demonstrate the effect of , but the dataset size is n = 2, 019, 501 (i.e., the whole dataset).We choose order_hour_of_day and reordered columns to evaluate the performance.
In Fig. 8, we present the results of the adapted MLE attack (Lacharité and Paterson 2018) applied to the PFSE scheme with various partitioning function selections.We find that the PFSE scheme outperforms the RND scheme due to its strategy of padding dummy records into the encrypted dataset.This approach enlarges the searching space of the ciphertexts and is deemed to be a valuable strategy for thwarting snapshot adversaries.Furthermore, our experimental results suggest that larger values (i.e., (30) fewer partitions) lead to decreased security levels.This is due to the fact that smaller values create "smoother" partitions, which conceal the frequency information.
Security with different δ .We investigate the influence of altering δ values by analyzing a range of five distinct values belonging to the set {0.10, 0.15, 0.20, 0.25, 0.5} .Our goal is to gain a comprehensive understanding of how the fluctuation of δ values affects the attack results.As shown in Fig. 9, we find that, for δ ≤ 0.25 , the adapted MLE attack (Lacharité and Paterson 2018) fails to acquire any significant information, due to the injection of dummy records.However, even with a relatively high δ value ( δ = 0.5 ), the MLE attack's accuracy is no worse than that of RND.These observations demonstrate that the PFSE scheme is capable of achieving high levels of security while incurring only a minor overhead of server storage (recall the result in Fig. 6d).

Comparison with LPFSE schemes
For the comparison of attack results, we adopt the same settings for both PFSE and LPFSE schemes as above.As shown in Fig. 10, we find that, for all attributes, although both LPFSE-IBHE and LPFSE-BHE schemes can attain the same level of security as the RND scheme, our PFSE scheme outperforms them by a significant margin in the presence of a statistically optimal attacker.Our scheme demonstrates remarkable effectiveness in preventing the MLE attack (Lacharité and Paterson 2018) from acquiring meaningful information about the underlying dataset, even for attributes with small domains, such as SEX or reordered, that may be either "smoothed" or skewed.

Conclusion
In this paper, we revisit the notion of Frequency-Smoothing Encryption (FSE).We find that there exists a lack of rigorous security definitions for FSE schemes, especially in the presence of inference attacks, and that current approaches are not efficient enough.We propose a novel FSE scheme based on the partitioning strategy with two security definitions from both theoretical and practical perspectives.We conduct thorough evaluations on the performance and security of the PFSE scheme and also compare it with previous FSE schemes.Experimental results show that our scheme has a significant advantage over previous methods while not downgrading the robustness against the inference attack like MLE (Lacharité and Paterson 2018).Our PFSE scheme is practical and secure, in the use of encrypted databases for equality queries.

3
The challenger chooses b $ ← {1, . . ., k} according to f G (m b ) , i.e., Pr[b = i] = f M (m i ) .The challenger then chooses the message m b ∈ Supp(M) .Then the chal- lenger invokes S to process m b obtains C ← S(m b ) .Finally, the challenger sends C to the adversary.4 A runs the inference attack algorithm Inf against C,M, and outputs a guess bit b ′ .The adversary A is said to win the experiment if b 2. The challenger generates a random bit b $ ← {0, 1} and a secret key sk ← PFSE.keygen(1 κ ) .Then she parti- tions M b into G b which is then transformed into G ′ b .3. Finally, the challenger smooths G ′ b using the FSE scheme and obtains C. Afterward, the challenger sends C back to the adversary A. 4. Upon receiving the messages from the challenger, A outputs a guess b ′ for b.The adversary A wins the security game if b ′ = b , and we denote this event by Game A,f θ IND−PCUDA (κ) ⇒ 1 .Then, we are able to define the security based on the game.Formally, we call the security game Indistinguishability against Partition-based Chosen Unordered Database Attack (IND-PCUDA).Definition 2 (IND-PCUDA secure) Let PFSE denote a partition-based frequency smoothing scheme parameterized by θ and a real function f θ : N * → (0, 1].

Fig. 8
Fig. 8 MLE attack result under different

Table 1
Symbols and their meanings

Table 2
Metadata for each column