Confidential Machine Learning on Untrusted Platforms: A Survey

With the ever-growing data and the need for developing powerful machine learning models, data owners increasingly depend on various untrusted platforms (e.g., public clouds, edges, and machine learning service providers) for scalable processing or collaborative learning. Thus, sensitive data and models are in danger of unauthorized access, misuse, and privacy compromises. A relatively new body of research confidentially trains machine learning models on protected data to address these concerns. In this survey, we summarize notable studies in this emerging area of research. With a unified framework, we highlight the critical challenges and innovations in outsourcing machine learning confidentially. We focus on the cryptographic approaches for confidential machine learning (CML), primarily on model training, while also covering other directions such as perturbation-based approaches and CML in the hardware-assisted computing environment. The discussion will take a holistic way to consider a rich context of the related threat models, security assumptions, design principles, and associated trade-offs amongst data utility, cost, and confidentiality.


Introduction
Data-driven methods, e.g., machine learning and data mining, have become essential tools for numerous research and application domains. With abundant data, data owners can build complex analytic models for areas ranging from social networking, healthcare informatics, entertainment, and advanced science and technology. However, limited in-house resources, inadequate expertise, or collaborative/distributed processing needs force data owners (e.g., parties that collect and analyze usergenerated data) to depend on somewhat untrusted platforms (e.g., cloud/edge service providers) for elastic storage and data processing. As a result, cloud services for data analytics, such as machine-learning-as-a-service (MLaaS), have been rapidly growing during the past few years. While untrusted platforms refer to all non-inhouse resources not directly owned by the data owner, we will use cloud services to represent them here forth.
When outsourcing sensitive data (e.g., proprietary, human-related, or confidential data), data owners have raised concerns in privacy, confidentiality, and ownership [1,2]. On the one hand, cloud users cannot verifiably prevent the cloud provider from accessing their data; i.e., in practice, using public clouds often means one must fully trust the cloud provider. On the other hand, public cloud providers are not immune to security attacks leading to sensitive data breaches. Recent security incidents, including insider attacks [3,2] and external security breaches at the service providers [4,5], show the risks are aggravating by day. Researchers and practitioners have developed solutions to protect the confidentiality of cloud data at rest. For example, Google Cloud Platform has allowed users to include an external key manager to store encrypted data on the cloud with a third party (e.g., Fortanix) stores and manages keys off the cloud. However, it remains a critical challenge for data owners and cloud providers to protect confidentiality in computing, i.e., learning models on the cloud, while protecting the confidentiality of both the training data and the learned models.
In the past few years, researchers have made some progress in developing novel confidential machine learning (CML) approaches for model training with encrypted data. A successful CML approach is not straightforward. Unlike traditional machine learning approaches, a practical CML framework wrestles in balancing security (confidentiality) guarantees, costs, and model quality, while allocating appropriate workload distributions between cloud and client. Direct application of cryptographic and privacy-protection methods such as fully homomorphic encryption (FHE) [6] and garbled circuits (GC) [7] in a homogeneous fashion do not usually meet the criteria for practical CML approaches. Most efficient approaches have been using hybrid methods that combine multiple primitives instead of a homogeneous translation. Recent studies [8,9,10,11,12,13] have followed this direction to effectively reduce performance bottlenecks and other practicality issues in developing CML solutions. However, the underlying techniques in these studies scatter among several papers making the basic principles are unclear. The purpose of this survey is to uncover these basic principles and accurately organize the existing techniques under a unified framework so that researchers and practitioners can quickly grasp the development and challenges in this new area of research. 2 Contributions and Organization Overview. Capturing a comprehensive view of a complex and new topic like confidential machine learning is challenging. We primarily focus on frameworks for model training using cryptographic techniques that guarantee strong (semantic) security with practical cost overburden. A complete machine learning service usually includes a model application (or model inference) component that applies the learned model to generate a prediction for new input data, equivalent to secure function evaluation. The confidential model inference is much simpler and in a more mature state than confidential model training, therefore, not covered in this survey. Interested readers may refer to the related studies about confidential inference with pre-trained models, such as Gilad-Bachrach et al. [14], Bost et al. [15], Hesamifard et al. [16], and Rouhani et al. [17].
This survey paper presents a unified perspective on designing and implementing different CML model learning methods with state-of-the-art cryptographic approaches. Despite numerous machine learning methods [18], the studies on CML methods have focused on only a few specific machine learning methods. On the other hand, researchers have applied several cryptographic methods to realize CML frameworks. We observe that many clever CML techniques apply to specific machine learning algorithms without clearly establishing the basic principles for extending these techniques to broader machine learning algorithms. To systematically understand the set of developed techniques in CML, we summarize them under a general framework, the decomposition-mapping-composition (DMC) procedure + design and selection of crypto-friendly algorithms. The DMC procedure involves: decomposing the target machine learning algorithm into several components, mapping these components to their cryptographic constructions, and finally composing the CML solution with the confidential component counterparts. Moreover, several CML approaches adopting the DMC process development have a unique additional feature: they use "crypto-friendly" alternative machine learning algorithms or components to achieve more efficient protocols. Keeping these observations in mind, we develop a systemization framework to summarize the design principles, strategies, cryptographic techniques, and optimization measures, which have been applied to solve the challenging problems in confidentially learning models over encrypted data.
We organize the survey based on underlying design principles of CML rather than any specific machine learning problems. As part of the survey, we summarize the experiences and learnings in each category of CML topics as insights and gaps. This work promotes practical aspects of applying cryptographic primitives in CML at their current level of maturity. Focuses will be on how different frameworks balance the associated trade-offs amongst cost, confidentiality, and data utility or model quality in different threat models and privacy settings. The survey, however, does not cover the orthogonal line of research that aims to optimize fully expressive primitives such as FHE and GC schemes. This survey will be a great resource for researchers to adopt and advance privacy-enhancing technologies in solving novel research questions and for practitioners to learn the best practices and avoid common pitfalls.
In the following sections, first, we will include the necessary background knowledge, notations, definitions, and the targeted threat model in Section 3. Then, in 3 Section 4, we present the systematization framework along with the basic principles and methodologies in the CML development. After that, we briefly discuss the homogeneous approaches that aim to translate any machine learning algorithm into a confidential one with a single cryptographic primitive (Section 5). Next, we move to the main theme: the compositional hybrid approaches (Section 6), which have resulted in more efficient protocols for complex machine learning models. We will also cover several topics, such as security proofs and common evaluation methods for cryptographic protocols in Sections 7 and 8. Finally, we briefly review other non-cryptographic-protocol approaches, including the perturbation methods and hardware-assisted (e.g., SGX) methods.

Related Work
A few survey papers are related to the topic of this paper. Shan et al. [19] focus on techniques for practical secure outsourced computation, using machine learning as a sample application. However, it does not comprehensively cover the major approaches as we do. Attacks on the integrity of machine learning models have also raised serious concerns due to the wide applications of machine learning in reallife scenarios such as self-driving cars [20]. Different from our survey focusing on the confidentiality of the model learning process, Papernot et al. [21] focus on the integrity of training data, learning process, models, and model application.
There are also several survey papers on a specific category of cryptographic primitives. Since the first fully homomorphic encryption scheme was published in 2009 [6], it has been an active research area during the past decade. Acar et al. [22] have a comprehensive review about the current development of homomorphic encryption schemes. Secure multi-party computation methods, including the garbled circuits and secret sharing methods, have been actively developed for the past two decades. Readers may find more information from other sources [23,24].
Differentially private machine learning frameworks are somewhat related to CML but hold a distinct thread model that aims to share data and models. They assume that the data consumer (i.e., model developer or model users) is not trusted, who may try to reveal private information in the training data shared by data owners or data contributors. It does not protect the ownership of data and models as the purpose is to share them without breaching individuals' privacy in the training data. Along with recent developments on differentially private deep learning such as Abadi et al. [25] and Shokri et al. [26], Ji et al. [27] and Sarwate and Chaudhari [28] also provide excellent surveys on this topic. Other studies in privacy-preserving data mining (PPDM) [29,30,31,32] also aim to share the data (and the models) while preserving individual's privacy, thus excluded from our survey.

Preliminaries of CML Approaches
In this section, we review the terms and concepts used in the literature. First, we look at the representative system architectures considered in the published confidential machine learning (CML) approaches based on cryptographic protocols. Then, we examine how different threat models, associated confidential assets, and considered attacks affect CML designs. Finally, we briefly describe prevailing cryptographic and privacy primitives that serve as the skeleton of most CML approaches. 4

System Architectures
The CML research is motivated by the cloud computing paradigm and then extended to more scenarios, such as edge computing and services computing. Thus, we use "Cloud" as the representative of untrusted platforms in CML system architectures henceforth. Such a system may involve cloud providers, optional cryptographic service providers, data owners or application service providers, and data and model consumers. Figure 1 shows an architecture with a data owner outsourcing its data and computation to a single cloud provider. The data owner must ensure the cloud provider does not compromise any proprietary and privacy-sensitive data. A few homomorphic-encryption-based frameworks, e.g., Graepel et al. [33] and Lu et al. [34], present protocols for training machine learning models over encrypted data outsourced to a cloud provider without almost any engagement of the data owner. However, the associated cost makes these protocols unrealistic in real-life scenarios.
An alternate strategy would involve the data owner in minimal tasks intermediately to simplify the single cloud architecture framework [12]. As long as the cloud takes the majority of the workload and the client's cost is practical, e.g., linear or sublinear to the number of records, more efficient protocols can be possible. As some protocols become too expensive for the data owner to assist cloud-centric learning, the architecture was evolved to a multi-server(cloud) setting. A data owner may choose to rely on two or more cloud providers to reduce the overall expense of learning. The second party may be as equally capable as the first party [11], or in the case of a cryptographic service provider (CSP), which manages keys and assists the cloud with intermediate decryption operations and light-weight computations [8,9,13]. The two non-client parties in such an architecture carry out secure multiparty computations without any of the parties learning the training data and the trained model. This setting also assumes that the two parties do not collude with each other, thus slightly more vulnerable than the client-cloud two-party setting. Figure 2 shows such a framework that uses a garbled circuit.

Threat Models
In this section, we examine the widely accepted threat models in the context of CML. We focus on the following aspects: the assumptions on the adversaries and the related confidential assets in CML.
Assumptions on Adversaries. Most CML approaches [13,11,8,9,33] adopt the honest-but-curious (or semi-honest) adversary model to describe the untrusted cloud provider. Honest-but-curious parties, by definition, perform their share of tasks obediently, i.e., guarantee data and model integrity and follow the pre-defined protocols exactly. However, they might clandestinely snoop the storage, interactions, and computations to learn private information. Data owners and data contributors' concerns about data and model leakages, even when the infrastructure platforms are reputed, are alleviated by preserving the confidentiality of data and models. Many CML approaches also use an honest-but-curious cryptographic service provider to design more efficient protocols. Some CML approaches additionally address an adversary which actively seeks to compromise data and model confidentiality by performing additional probing tasks, e.g., by inserting crafted records or secretively running the algorithms on a selected 5

Cryptographic Primitives 3 PRELIMINARIES OF CML APPROACHES
record set offline. Sharma and Chen [13] address the possibility of an adversary who may actively track identifiable training records to the datasets and follow the computations to infer the information about other training records. Nikolaenko et al. [9] consider an adversary that selectively runs the machine learning protocol over an individual's data to draw personal inferences from the learned models. Nevertheless, with either passive or active adversaries, CML approaches assume that the data and model integrity not be compromised at the end of the training. This assumption distinguishes CML from other studies such as attacks on machine learning by polluting training data or modifying learned models [35].
Moreover, the CML approaches often assume non-collusion between the involved parties, for example, between the cloud provider and the CSP [8,9,13] or the two cloud providers [11] in the two-server architecture. Collusion between the two parties in these frameworks directly compromises the privacy of the training data and learned models.
Most CML approaches assume that data and model consumers are trusted, which is orthogonal to the applications of differential privacy [26,25] that specifically targets untrusted data and model consumers. Furthermore, CML approaches assume properly secured infrastructures and communication channels to exclude external attacks and focus on the CML-specific challenges.
Confidential Assets at Risk. An adversarial party may be interested in the confidentiality of sensitive data and the generated models. All CML methods protect the training data feature vectors. Some methods designed for supervised learning [33,9] expose the training data labels to simplify their secure modeling algorithms with the assumption that knowing the labels will not bring significantly more information to adversaries, which might be false for some applications. Some CML studies also expose unprotected models [33,9,34]. However, recent studies [36,37,38,39,40] have shown that an adversary may use crafted data to infer sensitive training data or use the advanced features in deep learning models to breach data privacy. Furthermore, the intermediate results of outsourcing computations in the setting of federated learning, for example, the intermediate representation in a convolutional neural network learning, may reveal information about the private training data [26]. Thus, CML must protect both data and model confidentiality.

Cryptographic Primitives
The cryptographic primitives are the fundamental building blocks for CML approaches. Some of these primitives are more expressive -meaning they can implement more types of functions or higher-level functions. On the other hand, some primitives are more cost-efficient than others. To make this survey self-contained, in this section, we briefly cover the most frequently-used primitives in existing CML approaches.
Additive Homomorphic Encryption (AHE). AHE schemes (e.g., Paillier encryption [41]) allow the additive operation over encrypted messages without decryption. For any two integers α and β, an AHE scheme allows the additive homomorphic operation: E(α + β) = f (E(α), E(β)) where the function f works on encrypted values. Conceptually, with one of the operands unencrypted, a "pseudohomomorphic" multiplication between two messages can be expressed as a series 6 of additions [1] , i.e.,E(αβ) = E( β i=1 α). With homomorphic addition and pseudohomomorphic multiplication, one can derive pseudo-homomorphic dot-product of vectors, matrix-vector multiplication, and matrix-matrix multiplication. However, the unencrypted operands in these operations either need to be non-sensitive information or protected with some masking and de-masking mechanism [12,13]. ElGamal, Goldwasser-Micali, Benaloh, and Okamoto-Uchiyama cryptosystems are some additional examples of AHE schemes [22].
Somewhat Homomorphic Encryption (SHE). There are many encryption schemes in this category (e.g., BV, BGV, NTRU, GSW, BFV, and BGN [22] and their variations such as TFHE [42] and CKK [43]). SHE schemes allow both homomorphic additions and multiplications over encrypted messages, while the number of consecutive multiplications is limited to a few. A popular SHE scheme used in CML is the ring learning-with-error (RLWE) scheme that relies on the intractability of the learning-with-errors (LWE) problem on polynomial rings [44]. Theoretically, RLWE supports arbitrary levels of multiplications. Therefore, it is considered to be fully homomorphic. However, due to the associated high cost for deeper levels of multiplications, RLWE is more suitable as a SHE scheme only (i.e., 1-3 levels of multiplications). A ciphertext in RLWE is represented as a two-tuple (c 0 , c 1 ), where c 0 and c 1 are polynomials. Let C i = (c 0,i , c 1,i ) and C j be the ciphertext of any two values. The encrypted addition of the two values is simply (c 0,i + c 0,j , c 1,i + c 1,j ). The encrypted multiplication is translated to a series of polynomial operations on the ciphertext elements. RLWE allows multiple levels of multiplication at a certain cost. For details, please refer to the paper [44]. Message packing [44] enables packing multiple ciphertexts into one polynomial, which considerably reduces RLWE's ciphertext size and optimizes linear algebra operations [45]. HELib library [45] is a popular implementation of the RLWE scheme.
Garbled Circuits (GC). Garbled Circuits (GC) [7] allow two parties, each holding an input to a function, to securely evaluate a function without revealing any information about the respective inputs. GC can express arbitrary functions using several basic gates such as AND and XOR gates in a secure two-party computation setting (usually with a Cryptographic Service Provider (CSP)). One party constructs the circuit, whereas the other evaluates it. Despite several GC cost optimization techniques, such as Free XOR gates [46], Half AND gates [47], and OTExtensions [48], GC still incurs high communication costs. Therefore, one must carefully examine its use in composing CML frameworks. FastGC [49] and ObliVM [50] are two popular GC libraries.
Randomized Secret Sharing (SecSh). The randomized secret sharing method [10] protects data by splitting it into two (or multiple) random additive shares outsourced to two (or more) non-colluding untrusted parties. The two parties compute on the respective shares and return the results also as random shares. Addition is straightforward as α + β = (α 0 + β 0 ) + (α 1 + β 1 ) with α and β distributed between two parties 0 and 1. Multiplication, however, is expensive as it depends on the beaver triplet generation method [10,11], which further depends on expensive AHE or Oblivious Transfer (OT) schemes to exchange the intermediate results securely.
Random Additive Masking. A data owner may generate a random mask to hide the sensitive data, which will be stripped off at a certain step in the CML protocol to recover the desired result. Due to its low cost, it frequently serves as an auxiliary tool for a complex protocol, for instance, in CML for spectral clustering [12], boosting [13], and matrix factorization [8].

Systematization Framework
It is challenging to have a clear understanding of the whole body of CML model training methods due to the following reasons. First, the number of machine learning models is huge [18] and even the most used ones are around tens [51]. They are so different that no unified framework can be used to describe them. Second, security researchers are often more interested in a specific utility-preserving cryptographic primitive method and pick the machine learning algorithms they are most familiar with. As a result, the results are scattered with focuses on either a specific machine learning model or the application of a novel cryptographic primitive. There is no thorough understanding of which primitive method (or framework) is best for a specific machine learning method or whether a CML method can be extended to other machine learning models. The fundamental principles are missing for solving all (or most) CML model training methods.
Categories of CML approaches. We believe this survey is the first effort to systematically organize and analyze the whole body of most representative CML approaches. We focus on the major category of methods: the pure software-based cryptographic protocols, while also briefly reviewing the perturbation-based approaches and the hardware-assisted approaches. Figure 3 shows the systematization framework. The fundamental features of the three categories are as follows.
• The cryptographic protocols are the focus of this survey, which can be further divided into two categories: those using one cryptographic primitive homogeneously and those employing novel hybrid compositions of multiple primitives. The homogeneous approaches take one of the homomorphic encryption (HE) schemes or garbled circuits to develop the solution. The hybrid approaches involve multiple primitives and often a clever composition strategy to achieve lower overall costs. We will analyze them in more detail. • The perturbation-based CML approaches depend on novel data transformations to preserve a certain type of data utility, e.g., Euclidean distance, that is critical to one or multiple machine learning methods. Their security mainly depends on secret transformation parameters and random noise addition, holding a different and somewhat weaker security notion compared to cryptographic protocols. However, they are often much more efficient and thus appealing for many applications that seek better protection than plaintextbased approaches while not taking significantly more overhead.
• The third category depends on trusted execution environment, such as Intel SGX [52], which demands hardware-level supports and are thus distinct from the former two categories of pure software approaches. The hardware-level features enforce secure enclaves, in which the adversaries cannot observe the running programs and data. 8 Common CML Development Strategies. We look into a unified framework to analyze both the homogeneous and hybrid approaches. Fundamentally, most approaches aim to design an efficient and secure transformation of the specific (or a class of) machine learning algorithms for the setting of two or three distributed parties (see Section 3.1). To make the transformation easier, researchers often implicitly use the Decomposition-Mapping-Composition (DMC) procedure: decomposing the target algorithm into different subcomponents, mapping the sub-components to crypto-primitives, and composing the CML framework with the confidential subcomponents. Many approaches skip the description of this whole procedure and only present the final composition, which creates difficulties for newcomers to fully appreciate the fundamental ideas scattered in several approaches.
Beyond the straightforward DMC procedure, we have also noticed a unique feature [13] specific to the CML development: finding "crypto-friendly" alternative machine learning algorithms or components. This feature is unique to machine learning algorithms because all machine learning algorithms essentially try to find an approximate model fitting the training data, and there is no unique model for a specific problem, only better or worse ones. In general, machine learning methods can be roughly categorized into two types: supervised learning that depends on labeled datasets and unsupervised learning [18]. For each type, there are numerous algorithms working under the same setting but performing differently for specific applications or datasets. Even for the same algorithm, there are many variants. For example, different base classifiers can be used to make ensemble classifiers [53], and different activation functions can be used for neural networks [54]. Among so many machine learning algorithms, some are more crypto-friendly, i.e., they can be converted to more efficient CML solutions.
With all these features in mind, we reassemble the common development framework behind most CML approaches (Procedure 1).
Procedure 1 A common procedure for developing CML methods 1: procedure Generalized procedure for CML development(A) 2: A: the target algorithm 3: Identify the desired architecture and involved parties.

4:
Identify a list of alternative algorithms of A 5: for Each candidate algorithm do 6: decompose the algorithm to basic components 7: for Each component do 8: identify possible approximate/equivalent solutions 9: for Each solution do 10: identify candidate crypto-primitive mappings 11: end for 12: select the best solution and mapping. 13: end for 14: find the best composition method. 15: end for 16: evaluate the candidate alternative algorithms and identify the best one. 17: end procedure Note that most of the steps in this procedure cannot be automated, and thus each specific approach represents a result of enormous efforts behind the scene. Next, we analyze the homogeneous and hybrid approaches under this unified procedure. 9 Homogeneous approaches rely on a single primitive to construct the framework protocols. The primitives used in the homogeneous composition of CML are broadly in two categories: (1) Fully Homomorphic Encryption (FHE) and Garbled Circuits (GC) and (2) Additively Homomorphic Encryption (AHE) and Somewhat Homomorphic Encryption (SHE). Since FHE implements arbitrary levels of homomorphic addition and multiplication and GC implements the boolean gates, in theory, they can individually construct all CML algorithms. FHE and GC are, therefore, the most expressive privacy primitives. However, both FHE and GC are too expensive to be practical when mapped to for training complex CML models. Oppositely, AHE and SHE schemes provide limited support for encrypted operations, therefore, less expressive and can only enable relatively simple algorithms. Most approaches we discuss next are relatively simple, and thus AHE or SHE scheme is sufficient. The decomposition and mapping steps of the DMC procedure described in the last section are still at play in the homogeneous approaches, but the composition step is trivial. AHE and SHE are widely used to construct homogeneous solutions for applications involving only one or a few multiplications, including the elementary statistical aggregation functions, such as average, sum, and variance. Graepel et al. [33] present a SHE-based framework for learning Fisher's linear discriminant analysis and Linear Means Classifier models on encrypted data. However, the implemented models are limited to linearly separable datasets. Lu et al. [34] apply SHE for more sophisticated principal component analysis, and linear regression training [18]. However, due to the limited message space of the selected SHE implementation (60-bits in HELib) and the limited number of possible multiplications, only low data dimensionality (about 20) and a few training iterations were used in their evaluation. Such restrictions, however, resulted in only sub-optimal models.
More sophisticated machine learning algorithms often result in expensive homogeneous solutions. Phong et al. [55] employ LWE and Paillier encryption in encrypting the gradients in their privacy-preserving deep learning framework. The framework, however, takes over 2.5 hours to complete one iteration of a simple neural network training for 20,000 MNIST images. Researchers also aim to provide libraries for homogenous learning based on Garbled Circuits (GC). However, their uses are limited in practicality due to huge costs [50]. Liu et al. [50] present a GC-based KMeans learning framework that involves two untrusted servers. The associated cost overburden, however, is far from efficient in real-world settings. For example, the KMeans implementation required over 2,000 million AND gates and more than 200 GB communication for clustering just 6,000 data points. Rouhani et al. [56] propose a deep learning model inference frameworks using garbled circuits to protect both the model's parameters and test data samples. Similarly, the costs are staggeringly high.
Insight. Homogeneous solutions are often limited to simple functions involving only additions (for AHE), a few multiplications (SHE), or a few comparisons (GC). Individually, these crypto primitives are not practical to construct complex CML algorithms. However, they can be valuable components for hybrid solutions, as we will see later. 10

Hybrid Composition
As discussed above, depending on a single cryptographic primitive to compose a sophisticated CML algorithm is impractical. However, each primitive has its unique strengths and shortcomings (e.g., performance, storage, bandwidth advantage) in attaining certain operations. This realization leads to an interesting strategy: can we combine different primitives in such a manner to compose secure yet more optimized protocols? The idea of hybrid composition is thus, mixing and switching amongst several privacy primitives to avoid the associated cost bottlenecks and restrictions of any individual primitive. This section will look into the details of specific steps of the DMC procedure. First, we dissect the common sub-components and underlying operations in machine learning algorithms. We examine the various ways to implement these sub-components and operations confidentially. Then, we explore the different switching and mixing strategies, including some recent automated ones, essential to hybrid CML frameworks in practice. Finally, we discuss the unique feature or desired requirement of CML development: designing crypto-friendly machine learning algorithms or subcomponents for cost-efficient and practical CML solutions.

Basic Operations
We devote this subsection to inspecting the mapping of the foundational subcomponents of the target machine learning algorithms to their confidential versions. We observe that some of these mappings are practical or crypto-friendly, whereas others may face cost bottlenecks and limitations. The understanding of the different implementations of basic operations will affect the composition strategies.
Simple Arithmetic Operations With AHE or a SHE encryption scheme, one can conveniently add two encrypted integers. Adding two b bit integers with the Paillier cryptosystem involves modular multiplication with O(b 2 ) complexity. Additions with an RLWE-like scheme involve polynomial additions linear to the number of bits for the given polynomial degree [57]. With a specific integer encoding, subtraction becomes trivial expressed as encrypted additions. SHE schemes allow homomorphic multiplications over encrypted integers. RLWE-like crypto-systems allow several rounds of multiplications and additions. However, with each additional multiplications, the ciphertext noise, cipher size, and cost increase. Generally, multiplying two b bit integers with RLWE-like crypto-systems involves homomorphically computing O(b 2 ) AND circuits [57]. On the other hand, the AHE scheme requires one of the operands to be unencrypted to realize multiplication expressed as summations. With Paillier encryption, multiplication is modular exponentiation of encrypted bbit message by the unencrypted b-bit operand with a cost complexity of O(b 3 ). The only caveat of using AHE-multiplication is that if the unencrypted operand is privacy-sensitive, a mechanism to mask it needs to be augmented, the masking recoverable after the multiplication is complete [12,8].
Additions and subtractions are trivial with randomized secret sharing in the multiparty setting with constant time complexity. Each party performs additions and subtractions on respective shares of data and shares the results for recovery. A GC protocol for addition requires two parties to construct O(b) many AND gates and carry out O(b) communication, encryptions, and decryptions along with O(b) 11 oblivious transfers when adding two b bit integers. Multiplication with randomized secret sharing involves a costly multiplicative triplet generation scheme that relies on Oblivious transfer or AHE [10,11]. For example, the AHE-based scheme incurs transmission of two encrypted integers between the parties and performing two homomorphic encryptions, multiplications, additions, and decryptions by each party.
Multiplying two integers of b bits with GC, on the other hand, requires construction and evaluation of O(b 2 ) AND gates involving O(b 2 ) communication, encryption, and decryption.
Comparison. Comparison is essential in many operations, such as sorting vectors and applying activation functions in training neural networks. Unfortunately, comparing two encrypted or protected integers is not trivial. Graepel et al. [33] pose the complexity of comparison as the reason to avoid algorithms like perceptrons and logistic regression in their SHE-based confidential ML framework. Veugen [58] presents a client-server interactive comparison protocol for two encrypted integers based on the AHE scheme, which involves computation and transfer of b many AHE encrypted bits. Each comparison incurs O(b) homomorphic multiplications for both client and server. Lu et al. [34] use the technique of "greater than" protocol [59] optimized with the message packing of the RLWE scheme for comparing two encrypted messages in a two-party setting. However, the associated complexity is an astonishing O(2 b /h) of homomorphic additions when comparing two b-bit integers while packing h messages in a ciphertext. With GC, a comparison between two b-bit integers is possible with O(b) AND gates and O(b) communication, encryption, and decryption by two parties. Since GC-based comparison for full integers is expensive, one may use an efficient one-bit sign checking protocol [11,13] by encoding negative integers as two's complement, making the comparison cost is constant to the number of bits. Note that the GMW protocol of Goldreich, Micali, and Wigderson [60] can perform comparisons just as garbled circuits but with O(b) rounds. A similar sign-checking protocol is possible with GMW. However, the GC-based comparison seems the popular choice in current solutions.
Division. Division can be essential to many analytics algorithms, e.g., from the computation of mean to the implementation of complex algorithms such as K-means [61] and Levenshtein distance [62]. Despite its prevalence and importance, translating division to its confidential version is expensive and often results in a performance bottleneck [63]. Veugen [58] presents a protocol for exact division in a client-server scenario, using the AHE scheme and additive noise masking. However, the protocol requires the divisor to be public knowledge. On top of that, the protocol requires O(b) homomorphic comparisons and O(b) encrypted communication for division between two b-bit integers. Dahl, Chao, and Tomas [64] present two AHE-based division schemes that rely on Taylor approximation in a secure multi-party setting. The schemes brought expensive O(b) encrypted communication. It is possible to perform integer divisions with GC when the two parties hold the numerator and denominator respectively in a 2-party setting [63,9]. However, even with several optimizations, a division between two b-bit integers involves the construction and evaluation of a circuit with O(b) non-XOR gates [63]. A more practical solution would be to decrypt the operands at a crypto-service provider and conduct division on plaintext before finally encrypting the result. 12 Linear Algebra Operations. Linear algebra operations, such as vector dot products, matrix-vector multiplication, and matrix-matrix multiplications, are the core operations for many machine learning algorithms. They are commonly implemented with the cryptographic versions of additions and multiplications with some tricks in RLWE-based SHE for improved efficiency. Among all available methods, the AHE and SHE-based implementations are the most efficient ones.
A dot product x T k y k involves O(k) element-wise homomorphic multiplications and additions. Similarly, a matrix-vector multiplication A n×k x k involves O(nk) homomorphic multiplications and additions, and a matrix-matrix multiplication A n×k B k×m involves O(nkm) multiplications and additions. With the AHE scheme, one of the operands must remain unencrypted for these multiplicative operations. Therefore, the unencrypted operand needs some level of protection, e.g., novel randomized masking [12] with a minimized cost. With the message packing feature for the RLWE-like SHE scheme, one can easily vectorize the vector and matrix operations with message packing to gain more efficiency [45]. With such facilities, Jiang et al. [65] can optimize matrix-matrix multiplication with only O(k) complexity for symmetric matrices of k dimensions.
Randomized secret sharing enables linear algebraic operations with the multiplicative triplet generation approach in a multi-party setting. However, this involves the expensive AHE or OT-based multiplicative triplet generation schemes as used in [11,10]. In computing a matrix-vector multiplication Ab, each party is responsible for O(n+k) encryptions and upload, O(nk) homomorphic multiplications, O(nk+n) homomorphic additions, and O(n) decryptions.
One can easily map linear algebra operations to garbled circuits. GC-based vector and matrix addition/subtraction require O(kb) and O(nkb) AND gates where b is the number of bits in the vector and matrix elements. They also result in O(kb) and O(nkb) communication, encryption, and decryption operations, respectively. GC-based dot product for two b bit vectors with k dimensions is a collection of sub-circuits for multiplication and additions, which consist of O(kb 2 ) AND gates. The cost also involves O(b 2 ) encryption and decryption, and O(b 2 ) encrypted communication. The GC-based dot product can easily extend to matrix-vector and matrix-matrix multiplication. However, GC-based linear algebra solutions are more expensive than HE-based ones.

Empirical Cost Comparison
We have formally analyzed different crypto implementations for each of the major operations. However, some of them look close in terms of bigO complexity levels.
To have a better idea how the cost differences look like for the different implementations of the same operator, we also prepare Table 1. Since this comparison rests on a specific hardware configuration and software implementation, readers should only focus on the relative differences rather than the actual numbers. After a careful study of available AHE and SHE implementations, we choose the most efficient one for each category: we use the HELib library [66] for the RLWE encryption scheme and implement the Paillier cryptosystem [41] for the AHE encryption scheme. We adopt the ObliVM (oblivm.com) library for the garbled circuits. We also take the AHE scheme for the multiplicative triplet generation when using the randomized 13 secret sharing (SecSh) method. We pick cryptographic parameters [2] corresponding to 112-bit security. All schemes allow at least 32-bit messages-space overall. The RLWE parameters allow one full vector replication and at least two levels of multiplication. Note that the GC and SecSh costs are for the two-party setting, which has to involve communication costs between the two parties. Thus, we also include the bytes of exchanged messages for these methods. We run the experiments on an Intel i7-4790K CPU running at 4.0 GHz using 32 GB RAM with Ubuntu 18.04. Table 1 compares the related costs of arithmetic operations over integers. We have observed that the AHE scheme has the most efficient arithmetic additions and multiplications. However, for comparison and division, the 2-party garbled circuits are the only viable option. The table also shows the costs for the linear algebraic operations. The observation is consistent with the simpler arithmetic operation of additions and multiplications. As we can fit multiple messages in a ciphertext when using the RLWE scheme, the vectorized additions and multiplications are much more efficient than the non-vectorized additions and multiplications. The RLWE with message packing realizes homomorphic additions more efficiently when compared to the Paillier scheme. The RLWE costs for dot product and matrix-vector multiplication involve the ciphertext replication costs. Although better than without message packing, the RLWE scheme with the vectorized linear algebraic operation is still slower than the Paillier solutions. Randomized secret sharing is almost free for vector addition but involves higher computation and communication costs for the dot product and matrix-vector multiplication. Garbled circuits appear to be the worst solution for the confidential versions of the linear algebraic operation with higher computation and communication costs between the two parties. Although the Paillier implementation shows performance advantages over RLWE on arithmetic operations, it requires one operand to be plaintext. Paillier's encryption and decryption costs, however, are higher than that of RLWE [12]. When CSP is involved in a solution, encryption and decryption costs will become a critical performance factor. These cost comparisons on the basic operations will be useful for readers to analyze and compare a pair of CML protocols, especially when not all CML methods are open-source.
We do not experimentally compare complete CML approaches because 1) different approaches often solve different ML problems, which makes the comparison difficult, and 2) not all approaches have open-sourced their implementation or shared executable binaries. However, we hope the empirical comparison between different implementations for basic operators gives an intuitive understanding of the rationales behind different CML design strategies and optimization methods. We refer readers to the papers describing CML approaches that often contain detailed performance comparisons between selected CML approaches.
Insight. Based on most studies, the most efficient constructions for confidential comparison are GC-based, while SHE and AHE are better candidates for linear algebra operations. Since most division schemes are too expensive, one should consider transforming the functions/algorithms with divisions to the equivalent (often approximately) ones that involve no division. [2] The Paillier cryptosystem uses a 2048-bit key size. We set the degree of the corresponding cyclotomic polynomial in the RLWE scheme to φ(m) = 12, 000 and c = 7 modulus switching matrices, which gives us h = 600 slots for message packing. 14 6 HYBRID COMPOSITION 6.2 Switching and Composing Strategies

Switching and Composing Strategies
When composing the confidential versions of operations implemented with different primitives, there is an important step: switching computation flows between the primitives. This switching often requires a second party in the CML frameworks, i.e., either the data owner, the second non-colluding cloud, or a CSP to achieve better performance. HE to/from GC. Switching from a HE component to a GC component involves a second server (e.g., a CSP) in the framework. A straightforward approach would be including a data decryption circuit inside a garbled circuit to be evaluated by the two parties. However, such an approach is super-expensive [8]. A more practical strategy [8,9,13] is to have the party holding the encrypted data, denoted P A , mask it homomorphically before sending it to the second party, P B for decryption. The second party constructs the desired garbled circuit, where the first step of the garbled circuit is de-masking the data with inputs: the decrypted masked data from P B and the mask from P A .
SecSh to/from GC. Switching from a SecSh component to a GC component is straightforward in a two-party architecture. The two random shares in possession of the two parties can be their respective private inputs to the desired garbled circuits [11,13,67]. Similarly, switching from GC to SecSh involves evaluating the GC and randomly distributing the output to two parties [67].
SecSh to/from HE. A switch from randomized secret sharing to a HE component needs two involved parties to encrypt their respective shares. Then, one of the parties homomorphically reconstructs the protected value from the shares. Similarly, a switch from a HE component to a randomized secret sharing protocol includes a masking mechanism (homomorphic noise addition) similar to the HEto-GC switch discussed above. These two switches are relevant in the AHE-based multiplicative triplet generation protocol for randomized secret sharing [11,10]. Table 2 provides some examples of switching between cryptographic primitives in well-known CML approaches. These switchings lead to simplification of the CML framework and cost optimizations, as explained in the "Justification" column of the table. The ABY framework [10] covers different adapter-like switching protocols for the multi-party computation settings, where two servers hold the training data as arithmetic, boolean, or Yao's garbled shares. The ABY3 [68] and BLAZE [69] framework extend the switches to 3-party scenarios. These works, however, do not cover the switching from and to the homomorphic encryption schemes.
Manual vs. Automated Composition. Most existing CML approaches using the hybrid composition strategy [11,13,12,9] are manually composed as there are myriads of problem-specific details to address. A line of research explores the possibility of automatically composing the CML frameworks [70,71]. Although promising, the automatic composition strategy of Dreier and Kerschbaum [70] depends on the availability of an extensive performance matrix for the different confidential versions of the target algorithms' components. Henecka et al. [71] propose the TASTY compiler that automatically compiles a given machine learning problem as a mixture of garbled circuits and homomorphic encryption in a secure two-party computation framework. However, the process is still not fully automated -it requires a privacy expert to design and specify the components as well as the recommended mappings. 15 Gap. Due to the high complexity of formulating the component-wise costs and profiling the switching costs, the automated composition approaches are not yet fully mature. More importantly, as we will see in the next section, the construction of a practical CML solution involves one more crucial step that automated composition methods cannot help much. One must establish an in-depth understanding and analysis of the target ML algorithm to redesign a "crypto-friendly" algorithm.

Crypto-friendly ML Algorithms
So far, the DMC framework seems straightforward: one decomposes the target machine-learning algorithm to its sub-components and maps them to cryptographic constructions, and the final composition becomes almost trivial except that the primitive switching requires some clever steps. With enough experimentation, one can find an optimal set of confidential components for the target ML algorithm. However, this straightforward strategy may only work for some problems. Despite the best optimization of mapping and composition, one may still end up with an impractical protocol, although better than the homogeneous or other suboptimal compositions. The fundamental reason is that the original machine learning algorithms do not account for confidential computation. They are optimized to achieve the best model prediction power rather than to be crypto-friendly. On the other hand, a less-known slightly-under-performing ML algorithm that attains the same learning goal might be more cost-effective to translate to its confidential version. Thus, an advanced design step critical to the DMC procedure is replacing or redesigning some of the underlying ML components or even the entire ML algorithm to find the most efficient CML protocols. Table 3 summarizes some example CML frameworks that incorporate strategies to make their protocols crypto-friendly and hence more cost-effective. Mohassel et al. [11], in their SecureML work, substitute the expensive softmax operation involving inverses with a ReLU-based function involving only one division. This way, the framework significantly reduces the cost bottlenecks in their protocol. Graepel et al. [33] cleverly avoid division of encrypted data in the framework for confidential linear means classifier and Fisher's linear discriminant analysis by replacing divisions with a multiplicative factor. Nikolaenko et al. [9] use the more efficient Cholesky's decomposition instead of the expensive LU decomposition in solving a system of linear equations in their linear regression framework. Similarly, Nikolaenko et al. [8] adopt the sorting-based matrix factorization solution to reduce the overall complexity of computing gradient descent with Cholesky's decomposition-based matrix factorization. Sharma and Chen [13] propose to train a boosting classifier over encrypted data with an ensemble of random linear classifiers (RLC) instead of decision stumps. An RLC takes mere N encrypted comparisons, whereas a decision stump takes far too many comparisons. Naehrig et al. [72] replace the exponential function (the sigmoid) in their logistic regression protocol with the Taylor approximation of exponentiation. Computing the exact exponential function would have led to the computation of many levels of multiplications over the encrypted message -which would have been intolerably expensive with SHE schemes. Similarly, Sharma et al. [12] replace the inherently expensive eigendecomposition O(N 3 ) with cheaper O(N 2 ) approximation algorithms of Lancozs and Nystrom in their spectral clustering framework. 16 Data reduction techniques such as subsampling and preserving the sparsity of matrix are also critical to performance. Nikolaenko et al. [8], in their matrix factorization framework, use a sorting network that optimizes the garbled circuit-based gradient descent algorithm by only updating it for the user ratings that are present in the training dataset. Similarly, Sharma et al. [12] propose a differential privacybased graph submission mechanism that reduced total storage by over 15 times and costs involving encryptions and the associated homomorphic operations by over 20 times on the graph drastically when running the secure Nystrom method for spectral clustering. To sum up, although the approximate algorithms introduce some degradation to the learned models, they deliver desired cost practicality justifying the tolerable quality sacrifice.
Insight. For the same learning problem, there are numerous algorithms. Even for the same learning algorithm, there are many variants [18]. The search space for optimal composition can be quite large. More difficultly, most well-known ML algorithms are best known for model quality or learning efficiency and none specifically designed with optimal CML in mind. Even worse, some crypto-friendly alternatives might have been forgotten or become obsolete due to their suboptimal quality or efficiency. The design of a good CML solution heavily depends on the designer's deep understanding of the ML algorithms and even the history of ML algorithm development.
Gap. There is no systematic way to explore crypto-friendly alternative ML algorithms. The current practice is to design a problem/algorithm-specific cryptofriendly solution. Although the problem-specific design experiences and learnings can extend to a new solution design, there are no well-known rules or general frameworks for exploring such alternative ML algorithms yet.

Security Proofs, Attacks, and Correctness
In this section, we summarize the three aspects: security proofs, attack analysis, and correctness for existing CML approaches, which are commonly discussed in other cryptographic protocols.
Security Proofs. Homogeneous approaches do not use complex protocols other than the cryptographic primitive they use. For example, homomorphic encryptionbased approaches involve only simple interactions between the client and the cloud -the client submitting the data and the cloud computes and returns the result; the GC-based methods have two involved parties following the fundamental GC protocols. Thus, most such approaches simply skip the security proof step, fully depending on the proven security and privacy guarantees provided by the primitives.
For hybrid approaches, it's more sophisticated to prove their security, as they may include complex interactions among parties. We have observed two security proof frameworks are in prevalence. SecureML [11] utilizes the Universally Composable Security (UC) framework [73]. The UC security framework defines securitypreserving universal composition operation and allows for modular design and analysis of complex cryptographic protocols from simpler building blocks. PrivateGraph [12], SecureBoost [13], and Lu et al. [34] adopt the simulation-based security proof [74]. The simulation approach needs to show the existence of a simulator in the ideal scenario that corresponds to the adversary in the real scenario, such that it 17 is impossible to distinguish the interactions in the ideal scenario from those in the real scenario. The assumption of semi-honest parties held by most CML approaches makes the security proofs much easier [74,73]. As a result, many CML approaches ignore the steps of security proof.
Attacks. To our knowledge, attacks on the confidentiality of cryptographic CML approaches have not been fully explored. Most works we covered in this category did not mention any potential attacks on their approaches, partially due to the well-known security guarantees provided by the underlying primitives or formal security proofs provided by a few approaches. While all approaches want to fully protect feature vectors in the training data, some approaches require the labels (in supervised learning) to be exposed for easier modeling [33], and some even expose the final learned models [9,34]. However, recent studies have shown that exposed models may lead to serious attacks, such as model inversion attacks [37,75], and membership inference attacks [39].
Correctness. Contrary to some cryptographic protocols and encryption systems that need to prove their correctness (e.g., encrypted values can be correctly decrypted), the correctness of CML protocols is attached to the correctness of the original machine learning algorithms. The DMC procedure honestly reassembles the original learning algorithm with the cryptographic components. Thus, as long as the primitives preserve the correctness and the composition strategy does not change the correctness (see Section 6.2), the correctness property is guaranteed. However, when researchers adopt a crypto-friendly alternative algorithm or component, they must justify whether the alternative methods warrant/attain the desired learning objective. SecureBoost [13] depends on the basic boosting theory [53] that states any weak base classifier, including random weak linear classifiers, can be used for the boosting framework. Naehrig et al. [72] utilize the Taylor approximation of exponentiation to approximate the sigmoid function, which is a well-accepted mathematical method. While these alternative methods may affect the model quality, implying a potential trade-off between model quality and costs, they are all considered correct algorithms.
Gap. Security proofs are missing for some existing CML approaches, which raise a concern that they may contain flaws leading to significant information leaks. Further studies are needed to rigorously analyze these approaches.

Evaluation Methods
Researchers evaluate their proposed CML methods primarily based on costs and model quality. Some CML methods also involve trade-offs between these two aspects.
Costs. CML researchers primarily concern about the costs of protocol, striving to find the most efficient secure protocols. Since multiple parties are involved, the costs for each party, i.e., the cloud provider, the client, and possibly the cryptoservice provider or the second cloud provider, are all essential to the design of CML protocols. For a given CML method, each party's costs are the outcome of the cost for comparing the encryption/ decryption, data transmission, and other computation overhead. Because of the original motivation of outsourcing largescale computation, a skewed cost distribution between the client and the cloud 18 is fundamental, i.e., the client should take much lower overheads compared to the cloud [12,13,11]. However, the client may still take much higher costs when running CML protocols when compared to running the original non-secure ML solution. The cost of external storage and related I/O operations are also critical to the cloudside components as they are responsible for storing the encrypted data, which often is much larger than the plaintext version and cannot reside in memory. It is also highly desired that the cloud-side computation can be done parallelly with a popular processing framework such as MapReduce [76,12]. Besides, when GC is adopted as a primitive to implement some components, additional communication cost related to the GC protocol is also significant, including the cost of transmitting the circuit and one-party's input data obliviously to the other party [50,49]. As a result, the use of GC is limited to a few operations, such as comparison [10]. The overall computation and communication costs of different approaches are frequently compared and used as a measure to show the novelty of a new method. For example, Mohassel et al. [11] show their work is more computation efficient than the GC-based framework considered by [9] by about two orders of magnitude. Similarly, Sharma et al. [13] show their boosting solution is about three times faster than the neural network CML in [11]. Model Quality. Model quality, a unique feature of CML evaluation, is often tightly related to the cost of model training. Many machine learning algorithms are iterative, such as logistic regression, neural networks, and many clustering algorithms. As a result, model quality increases with the number of iterations until the process converges. However, a large number of iterations implies the increased overall costs. Some CML methods, e.g., Lu et al. [34], may only report the overall costs for one/few iterations of a specific learning algorithm, which is insufficient unless the number of iterations necessary for optimal results is specified. More precisely, many works miss the requirement that model evaluation should be tied to the cost evaluation, i.e., how much cost is needed to reach a certain model accuracy [11,13]. The discussion on crypto-friendly alternative algorithms also holds the assumption that model quality can be possibly traded off with costs, with the expectation that the crypto-friendly alternative may perform comparably or slightly worse than the original machine learning algorithm [13,12,11,33].

Other CML Approaches
So far, we have focused on cryptographic methods based on well-known primitives. To cover a panoramic view of development in the growing area of confidential machine learning, we briefly discuss two closely related approaches, the perturbationbased approach and the hardware-assisted approach.

Perturbation Methods
Most practical CML solutions that carefully follow the DMC process with some innovative uses of crypto-friendly ML algorithms still cost magnitudes more than the original plaintext algorithms. Especially if the learning algorithm is intrinsically expensive or relies on a massive-scale training dataset, the cryptographic primitives that provide semantic security may become impractically expensive, discouraging users from adopting the outsourcing paradigm. Another category of work: the 19 perturbation-based approach offers much more efficient solutions with some weaker security notions. Often, they do not guarantee semantic security and may only be resilient to ciphertext-only attacks. Nevertheless, they can be interesting for users who are willing to make a practical trade-off between efficiency and the level of protection. We briefly discuss this body of work to extend readers' interests to this unique domain. The basic idea of perturbation is injecting random noises into the outsourced data while (approximately) preserving some specific properties machine learning models rely upon. The most well-known properties are geometric and topological structures in the multidimensional space. Therefore, one can still train a model from the perturbed data on the untrusted platform with preserved confidentiality of both data and model. Typical perturbation methods include randomized response [77,78], additive perturbation [79], geometric perturbation [80], random projection perturbation [81], and random space perturbation [82]. They have been applied to decision tree learning [78,79], clustering [80,81], kNN classifier [80], support vector machines [80], linear classifier [80,83], and boosting [83]. The perturbation mechanisms can also disguise the training images in deep learning frameworks [84] to achieve much lower training costs than cryptographic protocols [11]. Furthermore, the perturbation methods often do not involve expensive cryptographic primitives. Consequentially, one can observe significant cost savings in the entire life cycle of data analytics, including data submission, computation, and communication amongst the involved parties.
Insight. The key idea of perturbation approaches is to identify a certain high-level utility and preserve it in secure randomized transformations. Similar ideas have also been explored in the cryptographic domain, such as order-preserving encryption [85,86,87] and encrypted keyword search [88,89].
Gap. Despite their efficiency, perturbation approaches face two critical weaknesses. First, perturbation methods may cause significant degradation to the data quality and introduce significant trade-offs between utility and confidentiality. Second, there is no systematic framework for analyzing the protection level guaranteed by a perturbation method. Some of them are known not to provide provable semantic security [80,82]. However, under a clear, rigorous threat model definition and thorough analysis, these methods will have high practical values in the venues where users can accept the specific threat model.

Hardware-Assisted Approaches
During the past few years, hardware-assisted trustworthy computing has made a significant breakthrough. In particular, several CPU manufactures have implemented the trusted execution environment (TEE) platforms, among which the most popular one is Intel's Software Guard Extensions (SGX) [52]. We will take SGX as an example in the following. SGX defines a specific memory area (e.g., the enclave).
Only the authorized owner can run programs and access data in the enclave via special instructions. Owners and users gain access rights via an attestation protocol. SGX minimized the trust boundary to the enclave, which means even though the entire operating system is compromised, adversaries cannot access the enclave. The physical enclave memory is limited (less than 100MB are usable by users). When 20 the enclave memory pages are swapped out/in by the virtual memory management subsystem of the OS [3] , they are encrypted/decrypted by the SGX library functions implicitly. SGX uses AES encryption (is this always true?), and thus the encryption and decryption costs are much lower than the primitives we have discussed so far. Besides, since the enclave program works on decrypted data, there is no need to develop special CML algorithms for running inside the enclave, making SGX an appealing platform for developing CML solutions for complex algorithms working with large data. However, there are a few challenges for migrating algorithms to the SGX environment. First, users need to learn the whole SGX working mechanism and learn to use special instructions and APIs, which can be inconvenient. A few efforts have simplified the migration of applications to SGX, among which the Graphene-SGX library OS [90], SCONE [91], and Panoply [92] are the most well-known. With a tool like Graphene-SGX, developing CML solutions becomes more straightforward. Lee et al. [93] have tried to migrate machine learning algorithms to SGX based on Graphene-SGX. However, these methods do not address side-channel attacks.
Second, side-channel attacks are considered the primary threat to SGX-based applications. As TEEs have prevented many traditional attacks and the assumption is now changed to adversary-controlled OS, side-channel attacks are active research areas. Memory side channels and cache side channels are the two types that researchers mostly examined. Memory side-channel attacks are primarily access pattern attacks [94,95,96]. As the encrypted data have to be loaded from the file to the untrusted area first and then accessed by the enclave, the access pattern attacks seem inevitable for data-intensive applications like CML. The well-known approach addressing this problem is the Oblivious RAM technique [97], which has been applied to SGX by ZeroTrace [94] and Obliviate [95]. Ohrimenko et al. [98] also used oblivious access techniques for multi-party machine learning with SGX. Branching attacks [96] utilize the branching statements and manipulate page faults to extract information, often addressable with oblivious branching instructions such as CMOV [96,94,99]. Cache side-channel attacks such as cache timing and transient execution state [100,101,102,103] utilize the unique CPU architectural features and thus depend on the manufacturers' firmware and software patches to fix. More studies are necessary to explore the full potential and unique problems with SGX-based CML.
Insight. The TEE, e.g., SGX, techniques can significantly boost CML's performance on untrusted platforms, as the solutions do not involve expensive crypto primitives or protocols. We consider the SGX based CML as a promising direction because it achieves a strong confidentiality guarantee with significant performance benefits compared to other approaches.
Gap. The most critical challenge TEEs face is side-channel attacks, especially the access pattern attacks. Also, machine learning algorithms have unique features (e.g., data access, batching, etc.) that may lead to specific attacks that have not been fully explored yet. Another practical concern is that most recent Intel server CPUs still have not had SGX enabled. A few cloud platforms such as Microsoft Azure and [3] The enclave virtual memory management is only enabled on the Linux system for early versions of SGX, which might be changed in newer versions of SGX 21

Conclusion
Despite the potential risk of data and model leakages, many resource-constrained data owners use untrusted platforms (e.g., clouds and edges) for training machine learning models. Researchers have been designing and developing confidential machine learning (CML) approaches for outsourced data using cryptographic primitives and various composition strategies. CML's overall goal is to protect the confidentiality of data, model, and intermediate results from the untrusted platforms while also preserving the trained model quality with acceptable costs. We have reviewed the recent significant developments on CML under a systemization framework, focusing on the cryptographic approaches. We have included the cryptographic primitives that are the backbone of the CML approaches and compared the costs for basic operations. While the homogeneous methods that rely on a single cryptographic primitive are straightforward, their solutions are too expensive to be practical. Thus, we focus on the primary design trend of the hybrid composition of multiple primitives under the decomposition-mapping-composition (DMC) procedure and the selection of crypto-friendly alternative learning algorithms. We describe the critical issues such as the switching between primitives and the principles of identifying crypto-friendly machine learning algorithms. Finally, we also include a brief discussion of related approaches and new directions, including the perturbation and hardware-assisted methods. At the end of most sections, we have also included a concise summary area labeled with Insight and Gap for readers to get the gist conveniently. We believe this survey can be valuable to both researchers and practitioners to build more complex and practical CML solutions in the future.

Semantic Security
Target ML Algorithm

Crypto & Privacy Primitives
Garbled Circuits SHE AHE Randomized Sharing