Skip to main content

Revisiting frequency-smoothing encryption: new security definitions and efficient construction

Abstract

Deterministic encryption (DET) allows for fast retrieval of encrypted information, but it would cause significant leakage of frequency information of the underlying data, which results in an array of inference attacks. Simply replacing DET with fully randomized encryption is often undesirable in the scenario of an encrypted database since it incurs a large overhead in query and storage. Frequency Smoothing Encryption (FSE) is a practical encryption scheme to protect frequency information. Current FSE constructions still fall short of efficiency and a reasonable security definition. We revisit FSE and propose two security definitions from both theoretical and practical perspectives. Furthermore, we adopt a novel partitioning strategy to construct a new FSE scheme to improve performance. Experimental results show that compared with others, our scheme achieves excellent query performance while attaining security against inference attacks.

Introduction

In the era of big data, encrypted databases (Poddar et al. 2019; Popa et al. 2011; Tu et al. 2013; Arasu et al. 2013; Antonopoulos et al. 2020; Zhu et al. 2021) serve as a vital component of the cloud computing infrastructure to protect users’ data privacy. In the construction of encrypted database systems, Property-Preserving Encryption (PPE) and Deterministic Encryption (DET) are often adopted to allow for efficient retrieval directly on encrypted data. These algorithms leak frequency information about the underlying datasets, which severely undermines the security of the encrypted databases. Inference attacks (Naveed et al. 2015; Grubbs et al. 2017; Cash et al. 2015; Bindschaedler et al. 2018) can recover the plaintext with the frequency information. Fully randomized encryption can completely eliminate such security risks. However, when applying it in the settings of database systems, the cost of security might outweigh its benefits as the query complexity is quite formidable. Frequency Smoothing Encryption (FSE) (Lacharité and Paterson 2018) aims to balance the tradeoff between security and efficiency. In contrast to conventional fully randomized encryption, FSE allows for repetition of ciphertexts for a specific plaintext, while subject to rigorous control on the number of repetitions by the scheme’s security parameter. Therefore, we highlight the significance of FSE in the development of secure and efficient database systems.

There are no proper security definitions for FSE. The inadequacy of FSE schemes to achieve indistinguishability under chosen plaintext attack (IND-CPA) security also poses a significant challenge. The problem is how to quantify or constrain the leakage, which has not been satisfactorily addressed in prior FSE schemes. Lacharité et al. (Lacharité and Paterson 2018) proposed the homophonic encoding approach to model the adversary using the Kolmogorov-Smirnov test (Massey Jr 1951). It does not provide a clear understanding of the information leakage, and the authors only empirically evaluate it through the robustness against inference attacks (e.g., MLE attack (Lacharité and Paterson 2018)). The Poisson salt allocation scheme described in (Pouliot et al. 2019) does not offer any security metrics that measure the leakage. This scheme is vulnerable to a form of frequency analysis based on the knapsack problem. Grubbs et al. (Grubbs et al. 2020) uses fake queries to protect the access pattern of the key-value store. It does not provide a proper security definition.

The lack of proper security definitions for FSE causes the bad performance of previous schemes. The time complexity for querying a plaintext using the homophonic encoding strategy depends on a security parameter. As the security parameter decreases, the number of unique ciphertexts for a given plaintext increases. Although the security parameter is adjustable, as previously mentioned, this parameter provides very limited information about the scheme’s ability to restrict information leakage. In addition, the salt sampling stage as well as the query time complexity in the Poisson salt allocation scheme can be relatively slow if the exponential distribution parameter is large. The data owner may choose an inappropriate value of security or distribution parameter that causes a dramatic increase in query complexity.

Our work

We propose two proper security definitions that help to qualify the leakage and guide data owners to choose appropriate parameters. We adopt a novel partitioning approach to construct an efficient FSE scheme (named PFSE) for deterministically encrypted databases.

Challenge 1: Defining security for an FSE scheme with the absence of a clear metric. To overcome this, we choose inference attack as a suitable measure. To model the adversary’s behavior in such attacks, we can compare their advantage when using a baseline dataset encrypted by deterministic encryption to that of a dataset processed by the FSE scheme. This comparison helps us understand the adversary’s advantage and may serve as a paradigm in the design of FSE schemes for encrypted databases.

Challenge 2: Improve the performance of FSE schemes. Previous schemes either incur large overhead or introduce false positives. To address this, our scheme adopts a partitioning strategy that divides the message set into several partitions and smooths each partition independently. The partitioning strategy improves performance by avoiding the overhead when considering the dataset as a whole. Additionally, our scheme allows for flexible partitioning using a user-defined function, which further optimizes the performance.

Challenge 3: Define the proper security requirements for partitioning. We adopt the partitioning strategy to construct our PFSE scheme. It is crucial to prevent any information leakage to the adversary during the partitioning phase. We consider the adversary’s limitation of not knowing the messages’ order. Following the security definition proposed in (Pouliot et al. 2019), we extend it to PFSE and call it IND-PCUDA, which stands for Indistinguishability Under Partition-based Chosen Unordered Database Attack.

Our contributions

We summarize our contributions as follows.

  • We propose new security definitions for FSE schemes, both theoretically and practically, facilitating the design of FSE schemes.

  • We present a novel FSE framework that leverages the partitioning strategy. This scheme improves the database performance and is more efficient than previous ones.

  • We evaluate the security and overhead of our scheme rigorously. Micro- and macro-benchmarks are conducted and compared with existing schemes (Lacharité and Paterson 2018) which are state-of-the-art FSE schemes for deterministically encrypted databases. Experimental results demonstrate that our scheme incurs several server-side storages but significantly improves the query performance while attaining security.

Related work

Deterministic encryption (Fuller et al. 2015) and order-preserving encryption (Popa et al. 2013; Kerschbaum 2015; Li et al. 2021) cannot hide the frequency information about the underlying dataset in the encrypted database. The leakage of frequency makes schemes vulnerable to inference attacks (Naveed et al. 2015; Bindschaedler et al. 2018; Grubbs et al. 2017; Durak et al. 2016). Protecting the frequency information with fully randomized encryption or structural encryption (Kamara and Moataz 2018) faces a huge overhead of storage and computation.

The notion of frequency smoothing encryption (FSE) (Lacharité and Paterson 2018; Grubbs et al. 2020) can protect the frequency information with a small overhead. Lacharité et al. (Lacharité and Paterson 2018) proposes an FSE scheme for deterministically encrypted databases by homophonic encoding strategies. Based on the Poisson salt allocation, Pouliot et al. (Pouliot et al. 2019) proposes weakly randomized encryption which can be seen as a variant of FSE to allow efficient search over ciphertext. Lacharité et al. (Lacharité and Paterson 2018) points out that this scheme is vulnerable to a form of frequency analysis based on the knapsack problem, and their bucketed countermeasure will increase the rate of false positives. The salt sampling algorithm of their scheme is not efficient enough. In addition, Grubbs et al. (Grubbs et al. 2020) propose an FSE scheme for the key-value store, which uses fake queries to protect the access pattern.

Preliminaries

Notation

We start by defining some important symbols. Let \({\mathcal {M}}\) be the set of plaintext messages and \({\mathcal {C}}\) be the set of corresponding ciphertexts. The plaintext dataset is represented as the multiset \(M = \{m_1, m_2, \ldots \}\) over \({\mathcal {M}}\) and the ciphertext dataset is represented as the multiset \(C = \{c_1, c_2, \ldots \}\) over \({\mathcal {C}}\). The support of a multiset M is given by \(\textsf{Supp}(M) = \{m \in {\mathcal {M}}: n_{M}(m) > 0\}\). We use \(f_{M}(m) = \frac{n_{M}(m)}{|M|}\) to represent the probability mass function of M and \(n_{{\mathcal {M}}}(\cdot )\) to denote the count of each element in M. The histogram of a dataset M is represented as an array \(H_{M}\) in which the i-th position has the value \(n_{M}(m_{i})\) such that \(H_{M}(i) \ge H_{M}(j)\) if \(i < j\). The i-th plaintext in \(H_{M}\) is represented by \(m_{i}\). Lastly, we use \(\Pr [e] \in [0, 1]\) to denote the probability of event e in a given probability space \(\Omega\). A deterministic symmetric encryption is represented by \(\textsf{DET} = (\textsf{Gen}, \textsf{Enc}, \textsf{Dec})\) with a secret key sk, and a keyed pseudorandom function is denoted by \({\mathcal {F}}_{sk}\). A summary of these symbols can be found in Table 1.

Table 1 Symbols and their meanings

Frequency-smoothing encryption

A partition-based FSE scheme, namely, PFSE, consists of the following algorithms.

  • \({\textsf{PFSE}.\textrm{keygen}(1^{\kappa }) \rightarrow sk:}\) The client generates a secret key sk given a security parameter \(\kappa\).

  • \({\textsf{PFSE}.\textrm{partition}(f_{\theta }, M) \rightarrow G:}\) Given a partition function \(f_{\theta }\) parameterized by \(\theta\), and plaintext message dataset M, this algorithm outputs the partition set G.

  • \({\textsf{PFSE}.\textrm{transform}(sk, f_{\theta }, \delta , n, G) \rightarrow {\mathcal {T}}, G':}\) Given the secret key sk, the partition function \(f_{\theta }\) parameterized by \(\theta\), the scaling factor \(\delta\) that lowers the advantage of the inference attack, the number of messages n, and the partition set G, this algorithm outputs the transformed partition set \(G'\) and a local state \({{\mathcal {T}}}\), which is referred to as a table.

  • \({\textsf{PFSE}.\textrm{encrypt}(sk, {\mathcal {T}}, m) \rightarrow C:}\) Given secret key sk, local state \({{\mathcal {T}}}\), and the plaintext m, and a counter j, this algorithm outputs the ciphertext set C of m.

  • \({\textsf{PFSE}.\textrm{decrypt}(sk, {\mathcal {T}}, c) \rightarrow m~\textbf{OR} \perp :}\) Given secret key sk, local state \({{\mathcal {T}}}\), and the ciphertext c, this algorithm decrypts c using the DET decryption scheme and checks if the decrypted message m is found in \({{\mathcal {T}}}\). It outputs m if m is found in \({{\mathcal {T}}}\) or outputs \(\perp\) if not.

  • \({\textsf{PFSE}.\textrm{smooth}(sk, {\mathcal {T}}, G') \rightarrow C:}\) Given secret key sk, local state \({{\mathcal {T}}}\) and transformed partition set \(G'\), this algorithm smooths \(G'\) and encrypts all the messages using DET encryption scheme. Finally, it outputs the corresponding ciphertexts C of \(G'\).

  • \({\textsf{PFSE}.\textrm{search}(sk, {\mathcal {T}}, m) \rightarrow R:}\) Given secret key sk, the local state \({{\mathcal {T}}}\), and the message m to be searched, this algorithm generates all possible ciphertexts of m and sends it to the server. It then filters out messages that cannot be found in \({{\mathcal {T}}}\) and finally returns the record set R.

Threat model

Our model assumes that a passive adversary somehow obtains a snapshot of the encrypted data, but it does not interfere with the normal functionality of the DBMS. The adversary also has the auxiliary dataset that may come from the Internet, accidental leakage of previous database records, or other illegal sources, and we assume that the adversary knows the exact distribution of the data. Adversaries that exploit the query pattern to recover the database records are out of scope. The adversary’s task is to recover which plaintext is the corresponding one to each ciphertext in the snapshot based on the auxiliary dataset. The same threat model is also captured in the work of (Ceselli et al. 2005), which they call the adversary \(\mathsf Freq + DB^{k}\), where \(\mathsf Freq\) denotes the auxiliary information on the distribution of the plaintext data and \(\mathsf DB^{k}\) denotes the encrypted database.

A new perspective about security of FSE

Before we introduce our frequency-smoothing encryption (FSE) scheme, it’s important to establish a standard for security. Given the potential for inference attacks based on snapshots of the encrypted database and the limitation that the adversary cannot know the order in which the database was created, we will outline both a theoretical and practical security definition. To ensure that there is information leakage of the partitioning technique, the theoretical definition adheres to traditional security principles (i.e., security games and indistinguishability) in cryptography, while the practical definition takes into account an attacker utilizing frequency information from the dataset via an inference attack.

Practical security definition

To properly assess the security of FSE, we must first understand the power and tactics of potential attackers, as well as what it means for the scheme to be considered “broken”. FSE was designed to resist inference attacks. Therefore, it is critical to evaluate the security of FSE from the perspective of statistical inference. FSE schemes aim to minimize or limit the success rate of the attacker.

However, we have observed that simply bounding the absolute rate of recovered plaintexts can be misleading. This is because the distribution from which plaintexts are drawn can significantly impact the success rate. For example, consider a dataset with only two possible values, where 99% of the records are labeled ALIVE and only 1% are labeled DIED. Even a fully randomized encryption scheme would not prevent an attacker from guessing the correct ciphertexts, since the attacker could simply assign all ciphertexts to ALIVE and achieve an accuracy of at least 99%. Thus, it would be unfair to attribute the failure of FSE robustness against inference attacks solely to the scheme.

Therefore, we propose a relative approach to defining security, as we are more interested in the extent to which FSE reduces an attacker’s success rate on a sanitized dataset. This is in line with the core purpose of FSE, which is to defend against inference attacks. To formalize this approach, we introduce the inference experiment \({\mathcal {E}}^{\textsf{Inf},{\mathcal {A}}, M}\), which generalizes the inference attack and captures the adversary’s behavior. In the following section, we further discuss this approach and its limitations.

Inference Experiment \({\mathcal {E}}^{\textsf{Inf},{\mathcal {A}}, {\mathcal {S}}}(M)\): For a given dataset M, a (possibly) randomized encryption scheme \({\mathcal {S}}: {\mathcal {M}} \rightarrow {\mathcal {C}}^{*}\), and a P.P.T. adversary \({{\mathcal {A}}}\), it performs the inference experiment as follows.

  1. 1

    The adversary fixes some distribution D, and randomly samples n records from D. Denote this dataset by M with a support \(\textsf{Supp}(M)\) of size k and the frequency of each message m is given by \(f_G(m)\).

  2. 2

    \({{\mathcal {A}}}\) sends the message dataset M to the challenger.

  3. 3

    The challenger chooses \(b \overset{\$}{\leftarrow }\ \{1,\ldots ,k\}\) according to \(f_G(m_b)\), i.e., \(\Pr [b = i] = f_{M}(m_i)\). The challenger then chooses the message \(m_b \in \textsf{Supp}(M)\). Then the challenger invokes \({{\mathcal {S}}}\) to process \(m_b\) obtains \(C \leftarrow {\mathcal {S}}(m_{b})\). Finally, the challenger sends C to the adversary.

  4. 4

    \({{\mathcal {A}}}\) runs the inference attack algorithm \(\textsf{Inf}\) against C,M, and outputs a guess bit \(b'\).

The adversary \({{\mathcal {A}}}\) is said to win the experiment if \(b = b'\), denoted by \({\mathcal {E}}^{\textsf{Inf}, {\mathcal {A}}, {\mathcal {S}}}(M) \Rightarrow 1\); otherwise, \({\mathcal {E}}^{\textsf{Inf}, {\mathcal {A}}, {\mathcal {S}}}(M) \Rightarrow 0\).

Following the inference experiment, we model the algorithm \(\mathsf Inf\) that \({{\mathcal {A}}}\) utilizes. We call the adversary optimal if it aims to find the solution to the maximization

$$\begin{aligned} \arg \max \left\{ \Pr [{\mathcal {A}} \text { wins}] \right\} , \end{aligned}$$
(1)

or equivalently, by the fundamental bridge,

$$\begin{aligned} \arg \max _{{{\mathcal {S}}}} \left\{ {\mathbb {E}}\left[ {\mathcal {E}}^{\textsf{Inf}, {\mathcal {A}}, {\mathcal {S}}}(M) \right] \right\} , \end{aligned}$$
(2)

in which the expectation can be rewritten as

$$\begin{aligned} {\mathbb {E}}\left[ {\mathcal {E}}^{\textsf{Inf}, {\mathcal {A}}, {\mathcal {S}}}(M) \right]&= \sum _{i}^{k}f_{M}(m_i) \cdot \Pr \left[ \textsf{Inf} \Rightarrow b | b = i\right] \end{aligned}$$
(3)
$$\begin{aligned}&= \sum _{i}^{k}f_{M}(m_i) \cdot g_{{\mathcal {C}}}({\mathcal {C}}_i), \end{aligned}$$
(4)

where, without loss of generality, we assume the attacker accurately divides the ciphertext set (i.e., all possible ciphertexts) \({{\mathcal {C}}}\) into k non-interleaving parts. That is, \({\mathcal {C}} = \bigcup ^k{\mathcal {C}}_i\) where \({\mathcal {C}}_i\) is the i-th partition and \({\mathcal {C}}_i \cap {\mathcal {C}}_j = \emptyset , \forall i \ne j\). Denote the weight of each set \({\mathcal {C}}_i\) as \(g_{{\mathcal {C}}}({\mathcal {C}}_i)\). The adversary’s guessing strategy is

$$\begin{aligned} \textsf{Inf}(C) \Rightarrow i, \quad \text {if } {\mathcal {S}}(m) \Rightarrow C \in {\mathcal {C}}_{i}. \end{aligned}$$
(5)

Obviously, from rearrangement inequality, for DET scheme, when \(g_{{\mathcal {C}}}({\mathcal {C}}_i) \ge g_{{\mathcal {C}}}({\mathcal {C}}_j), \forall i \ge j\), \({\mathbb {E}}\left[ {\mathcal {E}}^{\textsf{Inf}, {\mathcal {A}}, G}(M) \right]\) is the maximum. We have

$$\begin{aligned} \max _{m \in \textsf{Supp}(M)} f_{M}(m) = \max _{{{\mathcal {S}}}} \left\{ {\mathbb {E}}\left[ {\mathcal {E}}^{\textsf{Inf}, {\mathcal {A}}, {\mathcal {S}}}(M) \right] \right\} . \end{aligned}$$
(6)

Definition 1

(Security Against Inference Attacks) For a given P.P.T. adversary \({{\mathcal {A}}}\), let \({\textsf{DET}}\) denote the deterministic encryption scheme, \({{\mathcal {S}}}\) be a randomized encryption scheme, and \(\kappa\) be the security parameter. Define the advantage of \({{\mathcal {A}}}\) as follows.

$$\begin{aligned} \textsf{Adv} :=\frac{\max \left\{ {\mathbb {E}}\left[ {\mathcal {E}}^{\textsf{Inf}, {\mathcal {A}}, {\mathcal {S}}}(M) \right] \right\} }{\max \left\{ {\mathbb {E}}\left[ {\mathcal {E}}^{\textsf{Inf}, {\mathcal {A}}, \textsf{DET}}(M) \right] \right\} }, \end{aligned}$$
(7)

We say \({{\mathcal {S}}}\) is \(\delta\)-secure if for the budget for frequency smoothing \(\delta \in (0, 1)\), there exists a negligible function \(\textsf{negl}(\kappa )\) such that

$$\begin{aligned} \textsf{Adv} \le \delta + \textsf{negl}(\kappa ). \end{aligned}$$
(8)

Intuitively speaking, a smaller \(\delta\) results in a larger overhead for the FSE scheme while a larger \(\delta\) results in a smaller overhead. The data owner must carefully choose \(\delta\) to achieve a balance between usability and security. Although sometimes, the trivial guessing attack that assigns all ciphertexts to the most possible plaintext would perform better than a statistically optimal adversary, we believe that such cases are outliers and do not apply to all datasets. We believe an adversary will always try to be statistically optimal.

Extending to partition-based FSE schemes

IND-PCUDA security

For partition-based FSE schemes, there is another security constraint enforcing that the partitioning phase cannot leak extra information to the adversary, and we aim to capture it through traditional cryptographic indistinguishability. Unfortunately, to allow equality queries to be executed efficiently, our scheme is by no means resistant to chosen-plaintext attack (CPA) in the conventional cryptographic security definition. We therefore carefully devise a weaker definition after the standard Indistinguishability under Chosen-Plaintext Attack (IND-CPA). Pouliot et al. proposed a security definition called Indistinguishability under Chosen Unordered Database Attack (IND-CUDA) (Pouliot et al. 2019), where the adversary that utilizes the snapshot cannot know in which order the database was added. We aim to extend this security definition to partition-based FSE schemes. Let \(\theta\) be a set of public parameters (i.e., the adversary \({{\mathcal {A}}}\) can access it) and partition function \(f_{\theta }\) used in the PFSE scheme, we define the security game between a challenger and \({{\mathcal {A}}}\) with a security parameter \(\kappa\) as follows.

\(\mathrm {\textbf{Game}}^{{\mathcal {A}}, f_{\theta }}_{\mathrm {\mathbf {IND-PCUDA}}}(\kappa )\):

  1. 1.

    Initially, \({{\mathcal {A}}}\) fixes some plaintext space \({{\mathcal {M}}}\) and generates a pair of two message datasets \(M_{0}, M_{1}\) defined over \({{\mathcal {M}}}\). Let \(\pi (A)\) denote the permutation set for a given set A. The generation proceeds as follows.

    1. (a)

      Fix the total number of messages as n and partition number as k. For each index \(i \in [k]\), \({{\mathcal {A}}}\) calculates \(m_{i} = f_{\theta }(i) \cdot n\), where \(f_{\theta }: {\mathbb {N}}^{*} \rightarrow (0, 1]\) is a real function parameterized by \(\theta\) (public).

    2. (b)

      Then, for each index \(i \in [k]\), \({{\mathcal {A}}}\) chooses arbitrary number \(n_i \in {\mathbb {N}}\) and draws \(G_i\) from a multinomial distribution \(\textsf{Multi}(m_i, n_i)\). Afterward, \({{\mathcal {A}}}\) calculates \(\pi (G_i)\) and samples two permutations \(G^0_i \overset{\$}{\leftarrow }\ \pi (G_i), G^1_i \overset{\$}{\leftarrow } \pi (G_i)\) uniformly at random.

    3. (c)

      Finally, \({{\mathcal {A}}}\) collects \(M^0, M^1\) by \(M^0 = \bigcup _{i \in [k]} G^0_i\) and \(M^1 = \bigcup _{i \in [k]} G^1_i\), respectively.

  2. 2.

    The challenger generates a random bit \(b \overset{\$}{\leftarrow } \{0,1\}\) and a secret key \(sk \leftarrow \textsf{PFSE}.\textrm{keygen}(1^{\kappa })\). Then she partitions \(M_{b}\) into \(G_{b}\) which is then transformed into \(G'_{b}\).

  3. 3.

    Finally, the challenger smooths \(G'_{b}\) using the FSE scheme and obtains C. Afterward, the challenger sends C back to the adversary \({{\mathcal {A}}}\).

  4. 4.

    Upon receiving the messages from the challenger, \({{\mathcal {A}}}\) outputs a guess \(b'\) for b.

The adversary \({{\mathcal {A}}}\) wins the security game if \(b' = b\), and we denote this event by \(\textrm{Game}^{{\mathcal {A}}, f_{\theta }}_{\mathrm {IND-PCUDA}}(\kappa ) \Rightarrow 1\). Then, we are able to define the security based on the game. Formally, we call the security game Indistinguishability against Partition-based Chosen Unordered Database Attack (IND-PCUDA).

Definition 2

(IND-PCUDA secure) Let \(\textsf{PFSE}\) denote a partition-based frequency smoothing scheme parameterized by \(\theta\) and a real function \(f_{\theta }: {\mathbb {N}}^{*} \rightarrow (0, 1]\).

\(\textsf{PFSE}\) is said to be IND-PCUDA secure in the presence of a P.P.T adversary \({{\mathcal {A}}}\) if the following holds.

$$\begin{aligned} \Pr [\textrm{Game}^{{\mathcal {A}}, f_{\theta }}_{\mathrm {IND-PCUDA}}(\kappa ) \Rightarrow 1] \le \frac{1}{2} + \textsf{negl}(\kappa ), \end{aligned}$$
(9)

where the probability is taken over random coin tosses of \({{\mathcal {A}}}\), and \(\kappa\) is the security parameter.

The above security game and the definition is pessimistic: we assume the P.P.T. adversary \({{\mathcal {A}}}\) has full knowledge over the original dataset M. This is also rather reasonable because, in the real world, such adversaries are the strongest. Another point one should note is that the theoretical security game requires that the number of messages and that of categories in each partition are fixed, but does not pose any limitation on how the multinomial distribution is crafted. The scheme is secure as long as the adversary modeled in the security game has a negligible probability of winning the game.

For the part of practical security, an appealing property of the definition mentioned in Definition 1 is that it allows for easy extension to PFSEs. In a similar fashion, by linearity of expectation, we define the partition-based adversary’s advantage:

$$\begin{aligned} \textsf{Adv} :=\frac{\sum _{i \in [k]} \max \left\{ {\mathbb {E}}\left[ {\mathcal {E}}^{\textsf{Inf}, {\mathcal {A}}, {\mathcal {S}}}(M) \right] \right\} }{\sum _{i \in [k]} \max \left\{ {\mathbb {E}}\left[ {\mathcal {E}}^{\textsf{Inf}, {\mathcal {A}}, \textsf{DET}}(M) \right] \right\} }, \end{aligned}$$
(10)

assuming the number of partitions is k.

The partition-based FSE scheme

Partitioning

In this section, we describe the key technique that is used in our FSE scheme, the partitioning strategy. An important insight into why partitioning seems a suitable candidate for frequency smoothing is that it groups together messages with similar frequency information. In the real world, there exist numerous dataset that is unevenly distributed and some may have distorted frequency information. If we regard the dataset as a whole and want to hide the frequency information of each message, we may need to do some alignment: messages with high frequencies need to be padded to those with low frequencies. The cost of smoothing, therefore, will grow larger.

In our scheme, we aim to smooth the message dataset by the partitioning strategy. In particular, the data owner may apply a partition function \(f_{\theta }\) with a parameter set \(\theta\) to the whole dataset and divide the histogram of the original dataset into several partitions according to \(f_{\theta }\). \(\theta\) can be chosen so as to best fit the distribution of the dataset. Specifically, we partition the dataset M so that

$$\begin{aligned} \sum _{m \in G_{i}} f_M(m) = f_{\theta }(i), \end{aligned}$$
(11)

where i indicates the index of each partition.

Fig. 1
figure 1

The process of partitioning the histogram obtained from the original dataset M. The message \(m_6\) cannot perfectly fit into the second partition, so we split \(m_6\) into two parts according to the portion \(\delta\) and insert the second part back into the histogram

Splitting the message. One should note that partitioning is not always feasible. Consider some partition \(G_i\) and a message \(m'\) next to \(G_i\). If \((\sum _{m \in G_{i}} f_M(m)) + f_M(m') \ge f_{\theta }(i)\) but \(\sum _{m \in G_{i}} f_M(m) \le f_{\theta }(i)\), then \(m'\) cannot be placed into \(G_i\), and partitioning falls short of the desired property, as shown in Eq. 11. To overcome this challenge, we split the message \(m'\) if it resides at the intersecting point of two partitions. Figure 1 illustrates the partitioning algorithm.

Fig. 2
figure 2

The building blocks for the partition-based frequency-smoothing encryption scheme

The algorithm PFSE.partition in Fig. 2 shows the detailed steps of partitioning, which is quite straightforward:

  • Step 1: Compute the histogram (sorted by frequency in descending order) of the given dataset M.

  • Step 2: Compute the value of \(f_{\theta }(i)\) for each index i and try to group together the next message. If the last message \(m'\) cannot be fit to the current partition \(G_i\), then split m into two parts according to proportion \(\delta\) which is given by \((\sum _{m \in G_{i}} f_M(m)) + f_M(m') - f_{\theta }(i)\). The first part is \(\delta \cdot n_M(m')\), and the second part \((1 - \delta ) \cdot n_M(m')\) is added back to the histogram in a sorted manner according to the frequency, which done by the utility function \(\textsf{addSorted}\).

Transforming and smoothing the dataset

Fig. 3
figure 3

The transforming example. We consider three messages \(m_1\), \(m_2\), \(m_3\), and their respective occurrences in the original dataset: \(n_M(m_1) = 9\), \(n_M(m_2) = 6\), and \(n_M(m_3) = 3\). To determine the number of replicas for each message, the value \(k_i\) is calculated as \(k_i = \frac{1}{k} \cdot n_M(m)\). Thus the total number of replicas \(r_i\) is 9. The number of dummy records, \(d_1\), \(d_2\), and \(d_3\), is then determined as \(n_i - r_i\)

Transformation of the original dataset in partitions can be thought of as “duplicating” the messages according to the parameter \(k_i\) and padding dummy records to each partition. An example in which the partition number \(k = 3\) is depicted in Fig. 3. Formally, let \(n_i\) denote the number of distinct messages (including dummy records), \(r_i\) denote the number of distinct real messages and \(d_i\) denote the number of distinct dummy records. We are interested in the following equation.

$$\begin{aligned} n_i = r_i + d_i. \end{aligned}$$
(12)

Note that the size of ciphertext sets for each message m is given by \(k_i \cdot n_M(m)\), the above equation can be rewritten as

$$\begin{aligned} n_i = k_i \cdot f_{\theta }(i)n + d_i. \end{aligned}$$
(13)

In our scheme, for each partition \(G_i, i \in [k]\), we set \(k_i = \frac{f_{\theta }(i)}{k}\), where k is the total number of partitions. In addition, the constraint that \(d_i \ge 0\) should be satisfied so that the scheme is always valid for every partition \(G_i\). To satisfy this constraint, we set \(n_i\) as follows.

$$\begin{aligned} n_i = {\left\{ \begin{array}{ll} k_i \cdot f_{\theta }(i)n, &{}\text {if }a \le \frac{\delta \beta \cdot f_{\theta }^2(i)n^2}{k} \\ \frac{nt}{\delta \beta }, &{}\text {otherwise} \end{array}\right. }, \end{aligned}$$
(14)

where \(t = \sum _{m \in G_{i}}f_M^2(m)\) and \(a = \sum _{m \in G_i}n_M^2(m)\).

Dummy records. To fulfill the security requirement of achieving \(\delta\)-security, merely replicating messages from the original dataset is insufficient, as an adversary could still derive significant information about the underlying plaintexts. In order to enhance the obfuscation of the dataset distribution, we introduce additional dummy records into each partition, after determining the number of replicas required for each message. Our proposed scheme achieves this by adding random bit strings that are uniformly drawn from the entire message space and encrypted as dummy records, thereby further obscuring the true dataset distribution.

Local table. On the client side, our privacy-preserving scheme requires the creation of an additional table, denoted as \({{\mathcal {T}}}\), which maps each message m to a set of triplets \(\langle i, j, k \rangle\). The triplet notation consists of the index i of the partition that message m belongs to, j which represents the size of the set of ciphertexts for message m, and k which indicates the count of each ciphertext in the set. To illustrate, consider the example shown in Fig. 3, where \(\langle i, 3, 9 \rangle \in {\mathcal {T}}[m_1]\), \(\langle i, 2, 6 \rangle \in {\mathcal {T}}[m_2]\), and \(\langle i, 1, 3 \rangle \in {\mathcal {T}}[m_3]\). It is worth noting that the reason why each value in the hashmap \({{\mathcal {T}}}\) is a set is due to the possibility of messages being split into multiple portions, resulting in each message residing in several partitions. Additionally, it is important to highlight that the number of records in the local table \({{\mathcal {T}}}\) is precisely the number of distinct messages in the original dataset M, such that \(|\mathcal T| = N\). To ensure the privacy requirement of our scheme, it is necessary to store this table on the client side.

The main procedure is presented in PFSE.transform (see Fig. 2), which consists of the following steps.

  • Step 1: Duplicate each message m in each partition by \(k_i \cdot n_M(m)\) (this is also the number of ciphertexts of m for this partition) and add a new piece of metadata \(\langle i, size, k \rangle\) into the local table \({{\mathcal {T}}}\). In the meantime, increase the counter \(r_i\) that records the number of real message replicas.

  • Step 2: Determine the number of dummy records \(d_i\) by \(n_i - r_i\). The generation of dummy records is to uniformly sample a random bit string from the space \({{\mathcal {M}}}\) over which the message is defined. Afterward, the dummy record d is added to the current partition with a count \(\frac{1}{k_i} = k\).

Finally, after each partition is properly transformed according to the parameters \(k_i, n_i\), we apply the smooth algorithm to the transformed dataset. As shown in Algorithm PFSE.smooth in Fig. 2, it iterates over each partition and invokes PFSE.encrypt algorithm (see Fig. 2) to generate the ciphertexts for each message in the partition. The PFSE.encrypt algorithm first looks up the local table \({{\mathcal {T}}}\) to determine if the message m is a dummy. It then appends i denoting the index of the partition m belongs to and j denoting the index of the replica of m to the raw message m such that the ciphertext is obtained by encrypting m||i||j, where “||” means concatenation of strings.

Searching

Fig. 4
figure 4

The decryption algorithm

When searching for a specific keyword in a database using the PFSE scheme, the first step is to generate all search tokens by encrypting the keyword w, followed by sending an OR predicate to the database system. The searching phase is detailed in Algorithm PFSE.search (refer to Fig. 2). It’s worth noting that the algorithm needs to cross-reference the local table \({{\mathcal {T}}}\) to verify the validity of the given keyword w. Once the encrypted record set R is obtained, the PFSE.decrypt algorithm (see Fig. 4) filters out dummy records. In the case of a relational database system, the PFSE scheme sends a query of the form “SELECT * FROM TABLE WHERE ATTR = \(C_1\) OR ATTR = \(C_2\) OR ...”, where \(C_i \in {\mathcal {C}}, \forall i \in [|\mathcal C|]\).

Remark

It is also possible for the PFSE scheme to be made applicable to searchable encryption. For example, for each message m, the search tokens can be obtained by \(t \leftarrow {\mathcal {F}}_k(m || i || j), \forall \langle i, j, k \rangle \in {\mathcal {T}}[m]\), where \({\mathcal {F}}_k\) is a keyed pseudorandom function (e.g., HMAC). The ciphertext is obtained by any IND-CPA symmetric encryption scheme such as AES-GCM with a secure random nonce. Adopting PFSE to construct a searchable encryption scheme can make the server-side index more efficient.

Security analysis

We prove in this subsection that the aforementioned FSE scheme is both theoretically and practically secure. Informally speaking, for the theoretical part, the security of FSE derives from the use of secure cryptographic algorithms, i.e., the symmetric encryption algorithm that is strong enough to resist cryptanalysis, and the fact that any chosen distribution for each partition is uniform after smoothing. For the practical part, we carefully choose the parameter \(k_i\) and \(n_i\) so that the advantage of inference attack gained by the adversary can always be bounded below a given threshold \(\delta \beta\).

We now present the main theorem that establishes the theoretical security of the PFSE scheme as follows.

Theorem 1

The above PFSE scheme that uses a real function \(f_{\theta }: {\mathbb {N}} \rightarrow (0, 1]\) parameterized by \(\theta\) is IND-PCUDA secure as long as the underlying deterministic encryption \(\textsf{DET} = (\textsf{Enc}, \textsf{Dec}, \textsf{Gen})\) is secure against known-ciphertext attack (KCA) with regard to a security parameter \(\kappa\).

Proof

The idea of the proof follows the standard reduction method used in cryptography. We first reduce proof of security over the entire message dataset M into proof of security over each partition \(G_i\) that can be further reduced into the security of the underlying symmetric encryption scheme.

To begin with, let k denote the number of partitions and \({{\mathcal {A}}}\) denote a P.P.T. adversary in the IND-PCUDA game. If \({{\mathcal {A}}}\) succeeds in the security game, it means that there at least exists one index \(i \in [k]\) such that \({{\mathcal {A}}}\) is able to distinguish between \(G^{0}_{i}\) and \(G^{1}_{i}\) with different orders. Denote this event by \(\widehat{{\textsf{G}}^{\pi , {\mathcal {A}}}_{i}}\). Thus, with union bound, one obtains

$$\begin{aligned} \Pr \left[ \textrm{Game}^{{\mathcal {A}}, f_{\theta }}_{\mathrm {IND-PCUDA}}(\kappa ) \Rightarrow 1\right] \le \sum _{i \in [k]} \Pr [\widehat{{\textsf{G}}^{\pi , {\mathcal {A}}}_{i}}]. \end{aligned}$$
(15)

Consider arbitrary partition \(G_i\). Note that \(k_i = \frac{f_{\theta }(i)}{k}\), so the count of each replica for every message within \(G_i\) is always \(\frac{k}{f_{\theta }(i)}\); thus \({{\mathcal {A}}}\) is unable to distinguish between \(G^0_i\) and \(G^1_i\) by replicas. Now consider the count of each dummy record. Since our PFSE scheme enforces that it should be the same as the number of each replica, any two permutations of the current partition \(G_i\) will be smoothed into the same dataset where the occurrence of each element is exactly the same. Thus, we know

$$\begin{aligned} \Pr [\widehat{{\textsf{G}}^{\pi , {\mathcal {A}}}_{i}}] \le \Pr [\widehat{\textsf{DET}^{{{\mathcal {A}}}, \kappa }}] \le \textsf{negl}(\kappa ), \end{aligned}$$
(16)

assuming that \(\textsf{DET}\) is secure against known-ciphertext attack, where \(\widehat{\textsf{DET}^{{{\mathcal {A}}}, \kappa }}\) denotes the event that \(\textsf{DET}\) is broken by \({{\mathcal {A}}}\).

Rewriting Eq. 16 in Eq. 15, we derive the following.

$$\begin{aligned} \Pr \left[ \textrm{Game}^{{\mathcal {A}}, f_{\theta }}_{\mathrm {IND-PCUDA}}(\kappa ) \Rightarrow 1\right] \le k \cdot \textsf{negl}(\kappa ) = \textsf{negl}(\kappa ), \end{aligned}$$
(17)

which completes the proof. \(\square\)

Next, we turn to the main theorem of the practical security part. Before proving the main theorem, we introduce another lemma that facilitates the proof.

Lemma 1

Our scheme is \(\delta\)-secure if the following holds.

$$\begin{aligned} \frac{k_i}{n_i} \le \frac{\delta \beta f_{\theta }(i)n}{ak}, \end{aligned}$$
(18)

where \(\beta = \sum _{G} \max \left\{ {\mathbb {E}}\left[ {\mathcal {E}}^{\textsf{Inf}, {\mathcal {A}}, \textsf{DET}}(G) \right] \right\}\) is the baseline advantage, k is the number of partitions, \(a = \sum _{m \in G_{i}} n_M^2(m)\), \(n_i\) is the number of messages in \(G_i\), and n is the number of messages.

Proof

For each partition G, let N denote the number of messages in G, \(k_i\) denote the scaling factor, \(n_M(m)\) denote the count of each message m, and \(C_m\) denote the ciphertext sets for m. With union bound and the fact that the eventual distribution of ciphertexts is uniform, for every message \(m_i \in G\), one has

$$\begin{aligned} \Pr \left[ {\mathcal {E}}^{\textsf{Inf},{\mathcal {A}}, \textsf{PFSE}}(G) = 1 | b = i\right] \le \sum _{j \in |C_{m_i}|} \frac{1}{N} = \frac{|C_{m_i}|}{N}. \end{aligned}$$
(19)

Since \(|C_{m_i}| = k_i \cdot n_M(m)\), we have

$$\begin{aligned}&\sum _{i \in [k]} \max {\mathbb {E}}\left[ {\mathcal {E}}^{\textsf{Inf}, {\mathcal {A}}, \textsf{PFSE}}(G_i) \right] \end{aligned}$$
(20)
$$\begin{aligned}&\qquad \le \sum _{i \in [k]} \left( \sum _{m \in G_i} \frac{f_G(m)|C_{m}|}{N} \right) \end{aligned}$$
(21)
$$\begin{aligned}&\qquad = \sum _{i \in [k]} \left( k_i \cdot \sum _{m \in G_i} \frac{n^2_M(m)}{f_{\theta }(i)nN} \right) \end{aligned}$$
(22)
$$\begin{aligned}&\qquad = \sum _{i \in [k]} \frac{k_i a}{f_{\theta }(i)nN}, \end{aligned}$$
(23)

where \(a = \sum _{m \in G_{i}} n^2_M(m)\). If \(\frac{k_i}{N} \le \frac{\delta \beta nf_{\theta }(i)}{ak}\), we immediately obtain

$$\begin{aligned} \sum _{i \in [k]} \frac{k_i a}{f_{\theta }(i)nN}&\le \sum _{i \in [k]} \frac{a}{f_{\theta }(i)n} \cdot \frac{\delta \beta nf_{\theta }(i)}{ak} \end{aligned}$$
(24)
$$\begin{aligned}&= \sum _{i \in [k]} \frac{\delta \beta }{k} \end{aligned}$$
(25)
$$\begin{aligned}&= \delta \beta . \end{aligned}$$
(26)

Hence, it is straightforward to see that

$$\begin{aligned} \frac{\sum _{i \in [k]} \max \left\{ {\mathbb {E}}\left[ {\mathcal {E}}^{\textsf{Inf}, {\mathcal {A}}, \textsf{PFSE}}(G_i) \right] \right\} }{\sum _{i \in [k]} \max \left\{ {\mathbb {E}}\left[ {\mathcal {E}}^{\textsf{Inf}, {\mathcal {A}}, \textsf{DET}}(G_i) \right] \right\} } \le \delta . \end{aligned}$$
(27)

This completes the proof. \(\square\)

Theorem 2

Our FSE scheme is \(\delta\)-secure against the inference attack as per Definition 1.

Proof

We know that \(\forall i \in [k]\), the scheme enforces \(k_i = \frac{f_{\theta }(i)}{k}\) and

$$\begin{aligned} n_i = {\left\{ \begin{array}{ll} k_i \cdot f_{\theta }(i)n &{}\text {if }a \le \frac{\delta \beta \cdot f_{\theta }^2(i)n^2}{k} \\ \frac{nt}{\delta \beta } &{}\text {otherwise} \end{array}\right. }, \end{aligned}$$
(28)

where \(t = \sum _{m \in G_i} f_M^2(m), a = \sum _{m \in G_i} n_M^2(m)\). A straightforward computation yields

$$\begin{aligned} \frac{k_i}{n_i} \le \frac{\delta \beta f_{\theta }(i)n}{ak}. \end{aligned}$$
(29)

The theorem follows from applying Lemma 1. \(\square\)

Performance analysis

During the partitioning phase, the algorithm groups messages from the original dataset M into partitions, iterating over the histogram H(M). The time complexity of this phase is given by \(O(N + n + c)\), where the small constant c arises from the need to split messages on the intersecting point of two partitions. This constant is negligible and thus ignored in the subsequent analysis.

During the transforming and smoothing phase, the algorithm duplicates messages and pads them with dummy records, iterating over the partition set obtained in the previous phase. The time cost of duplicating messages is O(N), while that of padding dummy records is given by \(O(\frac{nk}{\delta \beta })\). Smoothing incurs a time cost of \(O(\frac{nN}{k})\), leading to an overall time complexity of \(O(N + \frac{nk}{\delta \beta } + \frac{nN}{k})\) for this phase.

Finally, the search phase’s time complexity depends only on the size of its replicas, with the cost of searching for a single message m given by \(O(\frac{n_M(m)}{k})\).

Regarding storage, the client’s storage cost is O(N), with a local table \({{\mathcal {T}}}\) stored on its side. On the server side, storage primarily comprises dummy records and ciphertexts of messages, with size \(n_i\) for all \(i \in [k]\). The server’s storage (excluding the index) can be calculated as \(k \sum _{i \in [k]} n_i\), which yields O(nk).

Evaluation

We evaluate our scheme by comparing its performance with LPFSE schemes (including LPFSE-IBHE and LPFSE-BHE) which are state-of-the-art frequency-smoothing encryption schemes. We evaluate the running time of initialization and query operation, and the storage on the client and server. Furthermore, we measure the success rate of recovered messages under MLE attack (Lacharité and Paterson 2018) which is an efficient inference attack. We set deterministic encryption (DET) and completely random encryption (RND) as two baselines for fairness.

Implementation details

We implement the Partition-based FSE (PFSE) and LPFSE schemes (including LPFSE-IBHE and LPFSE-BHE schemes) (Lacharité and Paterson 2018) in Rust for efficiency and memory safety. The code is open-sourced on GitHub (See https://github.com/hiroki-chen/PFSE-Prototype). Our experiments utilize MongoDB 6.0.3 running on a PC with dual Intel(R) Xeon(R) Gold 6248R CPU @ 3.00GHz CPU (48 cores, 96 threads in total), 128 GB RAM, 1Gbps network, and the operating system is Ubuntu 20.04 LTS. The AES-GCM (256 bits) algorithm with a fixed nonce (12 zero bytes) is used for deterministic encryption and a secure random nonce is used for fully randomized encryption. The auxiliary dataset is the same as the plaintext, which is the strongest scenario for the attacker.

Experiment setting

We create an index on the encrypted attribute for each experimental suite and deploy the client and server on the same machine to reduce the impacts of network latency. To measure the query benchmark, we randomly draw 100 keys from the dataset independently and send queries to the database to simulate a real-world scenario. All experiments run 10 times in single-thread mode, and results are averaged.

Datasets. Our scheme works on a variety of data types. We adopt two datasets in our experiments. One is the Supermarket Dataset for Predictive Marketing (SDPM) (Hunter 2023) which comprises a total of 2, 019, 501 records that capture E-Commerce customer behavior. The other is the American Community Survey (ACS) dataset (Bureau 2015) which consists of 1, 618, 489 valid records. To investigate the impact of different attribute domains and distributions on the performance of our FSE schemes, we select four columns from each dataset. For a comprehensive overview of the metadata associated with these columns, refer to Table 2.

Settings of the LPFSE schemes (Lacharité and Paterson 2018). The optimal choice of the security parameter \(\varepsilon\) for homophonic encoding strategies is a topic that has not been satisfactorily addressed in (Lacharité and Paterson 2018). To ensure a fair comparison, our experiment takes two steps. Firstly, we adopt the typical choices of \(r_{\min }\) as outlined in (Lacharité and Paterson 2018), though we use a different dataset as their dataset is currently unavailable. Secondly, we apply two variants (Section 4.2 in (Lacharité and Paterson 2018)) to optimize performance. Here, \(r_{\min }\) refers to the minimum encoding length for each attribute in the LPFSE-IBHE scheme (Lacharité and Paterson 2018), and we set a similar r for attributes similar to those evaluated in their work. It should be noted that our dataset size is greater than that of the original study, and to address this difference, we increase the encoding length r by 2. The choices of r for each scheme are also listed in Table 2. For the LPFSE-BHE scheme (Lacharité and Paterson 2018), we calculate \(\varepsilon\) by fixing r and setting the same value for all attributes.

Table 2 Metadata for each column

Performance evaluation

Microbenchmarks

In this section, we analyze the performance of the PFSE scheme under various settings. We evaluate four key metrics: server storage, client storage, initialization time, and query time. The data column we used is order_number (refer to Table 2), and for various dataset sizes n, we shuffle the original dataset randomly and select n records from it.

Performance with different \(f_{\theta }\). Without loss of generality, we fix the partitioning function \(f_{\theta }(x)\) as \(f(x) = \lambda e^{-\lambda x}\) and evaluate the effect of different choices of \(\lambda\) to simulate various partitioning functions. We test how \(\lambda \in \{ 0.25, 0.50, 0.75, 1.0 \}\) affects the performance with different dataset sizes n from \(10^3\) to \(10^6\). We set the security parameter \(\delta\) to 0.1 in this experiment.

We present a detailed analysis of experimental results obtained from our PFSE scheme, as depicted in Fig. 5. We highlight the scheme’s performance in initialization, comparing it with the insecure baseline (DET) and RND. The experiment shows that the initialization time blowup is \(\sim 90 \times\). It is reasonable and acceptable for real-world applications, given the infrequency of database initialization. It is noteworthy that the choice of \(\lambda\) does not significantly impact the initialization time. Smaller values of \(\lambda\) may lead to a slight increase in initialization time, as they create more “even” partitions, resulting in higher overheads during PFSE smoothing. While larger values of \(\lambda\) significantly increase the query time (refer to Fig. 5b). This outcome is attributed to the decreased number of partitions k as \(\lambda\) increases, thereby necessitating more replicas for each message to maintain privacy (one should recall that the number of replicas for each message is given by \(\frac{f_{\theta }(i) n_M(m)}{k}\)).

In terms of storage overhead, our scheme has a minimal impact on the client side. The storage requirements of the client side are only O(N), which results in a negligible increase in storage. The server experiences greater storage overhead due to the increased storage requirements. Compared with RND, the blowup lies in \(100 - 1000 \times\) when the number of data is \(10^6\), but the effect of \(\lambda\) on the server side is relatively small.

Fig. 5
figure 5

Microbenchmarks under different \(\lambda\)

Performance with different \(\delta\). We keep the partitioning function fixed as \(f(x) = 0.5 e^{-0.5x}\). We examine five different values of \(\delta \in \{ 0.10, 0.15, 0.20, 0.25, 0.5 \}\).

Fig. 6
figure 6

Microbenchmarks under different \(\delta\)

Consistent with the mathematical analysis that reducing \(\delta\) leads to more time cost for database initialization and query processing (refer to Fig. 6a, b). A smaller \(\delta\) leads to increased query time due to larger server storage and longer quey time required by the database backend. In the worst case (i.e., \(\delta = 0.10\)), the query time is only \(\sim 1.03\times\) slower than that of DET and performs far better than RND. Furthermore, unlike the effect of \(f_{\theta }\) on storage overhead, a smaller \(\delta\) entails greater server-side storage.

Comparison with LPFSE schemes

Fig. 7
figure 7

Time of query

We compare the query performance of PFSE, LPFSE-IBHE, and LPFSE-BHE, across various attributes. For the PFSE scheme, we use a fixed partitioning function of \(f(x) = 0.5 e^{-0.5x}\), and the choices of \(\delta\) are provided in Table 2. As shown in Fig. 7, the PFSE scheme outperforms both LPFSE-IBHE and LPFSE-BHE on most attributes in the dataset, except for those with small domains such as reordered and SEX. PFSE scheme incurs only a small query overhead, and the performance approximates that of the DET scheme. An interesting observation from our experiments is that the LPFSE-IBHE scheme experiences significant performance degradation when the dataset has a skewed distribution, with only a few messages appearing frequently while most appear infrequently. This is due to the security policy of the scheme, which enforces a minimum encoding length of \(r \ge \log _{2}\left( \frac{\sqrt{n}}{2\sqrt{2\pi }\varepsilon \cdot f_M(m_1)} \right)\). In fact, on most attributes, the LPFSE-IBHE scheme is even slower than RND due to the large search space incurred by the encoding strategy. Thus, on attributes with skewed distribution such as add_to_cart_order, our PFSE scheme achieves a significant speedup of \(\sim 510 \times\) compared to the LPFSE-IBHE scheme.

Security evaluation against inference attack

We concentrate on assessing the security of FSE schemes against state-of-the-art inference attack (Lacharité and Paterson 2018). The security is evaluated by the weighted average rate of messages that the attacker is able to recover. Formally, the recover rate \(\alpha\) is calculated as follows.

$$\begin{aligned} \alpha :=\sum _{m \in {{\mathcal {M}}}} f_M(m) \cdot \frac{\# R_m}{|C_m|}, \end{aligned}$$
(30)

where \(\# R_m\) denotes the number of the correct guesses for m, and \(C_m\) denotes the ciphertext set for m.

Firstly, we evaluate how partitioning function \(f_{\theta }\) and the security parameter \(\delta\) would exert effects on the PFSE’s robustness against inference attacks. Then, we compare the security of the PFSE scheme with that of LPFSE-IBHE and LPFSE-BHE.

Security with different \(f_{\theta }\). Similar to the settings in the performance evaluation, we also set the partitioning function to \(f(x) = \lambda e^{-\lambda x}\) with \(\lambda \in \{ 0.25, 0.50, 0.75, 1.00 \}\) and \(\delta = 0.5\) to better demonstrate the effect of \(\lambda\), but the dataset size is \(n = 2,019,501\) (i.e., the whole dataset). We choose order_hour_of_day and reordered columns to evaluate the performance.

Fig. 8
figure 8

MLE attack result under different \(\lambda\)

In Fig. 8, we present the results of the adapted MLE attack (Lacharité and Paterson 2018) applied to the PFSE scheme with various partitioning function selections. We find that the PFSE scheme outperforms the RND scheme due to its strategy of padding dummy records into the encrypted dataset. This approach enlarges the searching space of the ciphertexts and is deemed to be a valuable strategy for thwarting snapshot adversaries. Furthermore, our experimental results suggest that larger \(\lambda\) values (i.e., fewer partitions) lead to decreased security levels. This is due to the fact that smaller \(\lambda\) values create “smoother” partitions, which conceal the frequency information.

Fig. 9
figure 9

MLE attack result of different \(\delta\)

Security with different \(\delta\). We investigate the influence of altering \(\delta\) values by analyzing a range of five distinct values belonging to the set \(\{0.10, 0.15, 0.20, 0.25, 0.5\}\). Our goal is to gain a comprehensive understanding of how the fluctuation of \(\delta\) values affects the attack results. As shown in Fig. 9, we find that, for \(\delta \le 0.25\), the adapted MLE attack (Lacharité and Paterson 2018) fails to acquire any significant information, due to the injection of dummy records. However, even with a relatively high \(\delta\) value (\(\delta = 0.5\)), the MLE attack’s accuracy is no worse than that of RND. These observations demonstrate that the PFSE scheme is capable of achieving high levels of security while incurring only a minor overhead of server storage (recall the result in Fig. 6d).

Comparison with LPFSE schemes

Fig. 10
figure 10

The overall MLE attack result

For the comparison of attack results, we adopt the same settings for both PFSE and LPFSE schemes as above. As shown in Fig. 10, we find that, for all attributes, although both LPFSE-IBHE and LPFSE-BHE schemes can attain the same level of security as the RND scheme, our PFSE scheme outperforms them by a significant margin in the presence of a statistically optimal attacker. Our scheme demonstrates remarkable effectiveness in preventing the MLE attack (Lacharité and Paterson 2018) from acquiring meaningful information about the underlying dataset, even for attributes with small domains, such as SEX or reordered, that may be either “smoothed” or skewed.

Conclusion

In this paper, we revisit the notion of Frequency-Smoothing Encryption (FSE). We find that there exists a lack of rigorous security definitions for FSE schemes, especially in the presence of inference attacks, and that current approaches are not efficient enough. We propose a novel FSE scheme based on the partitioning strategy with two security definitions from both theoretical and practical perspectives. We conduct thorough evaluations on the performance and security of the PFSE scheme and also compare it with previous FSE schemes. Experimental results show that our scheme has a significant advantage over previous methods while not downgrading the robustness against the inference attack like MLE (Lacharité and Paterson 2018). Our PFSE scheme is practical and secure, in the use of encrypted databases for equality queries.

Data availability

Not applicable.

References

  • Antonopoulos P, Arasu A, Singh KD, Eguro K, Gupta N, Jain R, Kaushik R, Kodavalla H, Kossmann D, Ogg N, et al (2020) Azure sql database always encrypted. In: Proceedings of the 2020 ACM SIGMOD international conference on management of data, pp 1511–1525

  • Arasu A, Blanas S, Eguro K, Joglekar M, Kaushik R, Kossmann D, Ramamurthy R, Upadhyaya P, Venkatesan R (2013) Secure database-as-a-service with cipherbase. In: Proceedings of the ACM SIGMOD international conference on management of data, pp 1033–1036. ACM Press, New York, New York, USA

  • Bindschaedler V, Grubbs P, Tech C, Cash D, Ristenpart T, Shmatikov V (2018) The tao of inference in privacy-protected databases. Proc VLDB Endow 11(5):1

    Google Scholar 

  • Bureau UC (2015) American community survey (ACS) 2015. http://www.census.gov/programs-surveys/acs/

  • Cash D, Grubbs P, Perry J, Ristenpart T (2015) Leakage-abuse attacks against searchable encryption. In: Proceedings of the 22nd ACM SIGSAC conference on computer and communications security, vol. 2015-Octob, pp 668–679. ACM, New York, NY, USA

  • Ceselli A, Damiani E, Vimercati SDCD, Jajodia S, Paraboschi S, Samarati P (2005) Modeling and assessing inference exposure in encrypted databases. ACM Trans Inf Syst Secur 8(1):119–152

    Article  Google Scholar 

  • Durak FB, DuBuisson TM, Cash D (2016) What else is revealed by order-revealing encryption? In: Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pp 1155–1166

  • Fuller B, O’cneill A, Reyzin L (2015) A unified approach to deterministic encryption: new constructions and a connection to computational entropy. J Cryptol 28:671–717

    Article  MathSciNet  Google Scholar 

  • Grubbs P, Sekniqi K, Bindschaedler V, Naveed M, Ristenpart T (2017) Leakage-abuse attacks against order-revealing encryption. In: 2017 IEEE symposium on security and privacy (SP), pp 655–672. IEEE

  • Grubbs P, Khandelwal A, Lacharité M-S, Brown L, Li L, Agarwal R, Ristenpart T (2020) Pancake: frequency smoothing for encrypted data stores. In: Usenix security

  • Hunter (2023) Supermarket dataset for predictive marketing 2023. https://www.kaggle.com/datasets/hunter0007/ecommerce-dataset-for-predictive-marketing-2023

  • Kamara S, Moataz T (2018) SQL on structurally-encrypted databases. In: Advances in cryptology–ASIACRYPT 2018: 24th international conference on the theory and application of cryptology and information security, Brisbane, QLD, Australia, December 2–6, 2018, proceedings, Part I 24, pp 149–180. Springer

  • Kerschbaum F (2015) Frequency-hiding order-preserving encryption. In: Proceedings of the ACM conference on computer and communications security, vol. 2015-Octob, pp 656–667. ACM, New York, NY, USA

  • Lacharité MS, Paterson KG (2018) Frequency-smoothing encryption: preventing snapshot attacks on deterministically encrypted data. IACR Trans Sym Cryptol 2018(1):277–313

    Article  Google Scholar 

  • Li D, Lv S, Huang Y, Liu Y, Li T, Liu Z, Guo L (2021) Frequency-hiding order-preserving encryption with small client storage. Proc VLDB Endow 14(13):3295–3307

    Article  Google Scholar 

  • Massey FJ Jr (1951) The Kolmogorov–Smirnov test for goodness of fit. J Am Stat Assoc 46(253):68–78

    Article  Google Scholar 

  • Naveed M, Kamara S, Wright CV (2015) Inference attacks on property-preserving encrypted databases. In: Proceedings of the 22nd ACM SIGSAC conference on computer and communications security, vol. 2015-Octob, pp 644–655. ACM, New York, NY, USA

  • Poddar R, Boelter T, Popa RA (2019) Arx: an encrypted database using semantically secure encryption. Proc VLDB Endow 12(11):1664–1678

    Article  Google Scholar 

  • Popa RA, Li FH, Zeldovich N (2013) An ideal-security protocol for order-preserving encoding. In: 2013 IEEE symposium on security and privacy, pp 463–477. IEEE

  • Popa RA, Redfield CMS, Zeldovich N, Balakrishnan H (2011) CryptDB. In: Proceedings of the twenty-third ACM symposium on operating systems principles - SOSP ’11, p 85. ACM Press, New York, New York, USA

  • Pouliot D, Griffy S, Wright CV (2019) The strength of weak randomization: easily deployable, efficiently searchable encryption with minimal leakage. In: 2019 49th annual IEEE/IFIP international conference on dependable systems and networks (DSN), pp 517–529. IEEE

  • Tu S, Kaashoek MF, Madden S, Zeldovich N (2013) Processing analytical queries over encrypted data. Proc VLDB Endow 6(5):289–300

    Article  Google Scholar 

  • Zhu J, Cheng K, Liu J, Guo L (2021) Full encryption: an end to end encryption mechanism in GaussDB. Proc VLDB Endow 14:2811–2814

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (No.62302242) and China Postdoctoral Science Foundation (No. 2023M731802).

Funding

None.

Author information

Authors and Affiliations

Authors

Contributions

HC completed the paper writing and experiment. YY made contributions to the revision of the paper. SL thoroughly checked the manuscript and gave insightful comments.

Corresponding author

Correspondence to Siyi Lv.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, H., Yang, Y. & Lv, S. Revisiting frequency-smoothing encryption: new security definitions and efficient construction. Cybersecurity 7, 15 (2024). https://doi.org/10.1186/s42400-024-00208-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s42400-024-00208-w

Keywords