IHVFL: a privacy-enhanced intention-hiding vertical federated learning framework for medical data

Vertical Federated Learning (VFL) has many applications in the field of smart healthcare with excellent performance. However, current VFL systems usually primarily focus on the privacy protection during model training, while the preparation of training data receives little attention. In real-world applications, like smart healthcare, the process of the training data preparation may involve some participant’s intention which could be privacy information for this participant. To protect the privacy of the model training intention, we describe the idea of Intention-Hiding Vertical Federated Learning (IHVFL) and illustrate a framework to achieve this privacy-preserving goal. First, we construct two secure screening protocols to enhance the privacy protection in feature engineering. Second, we implement the work of sample alignment bases on a novel private set intersection protocol. Finally, we use the logistic regression algorithm to demonstrate the process of IHVFL. Experiments show that our model can perform better efficiency (less than 5min) and accuracy (97%) on Breast Cancer medical dataset while maintaining the intention-hiding goal.


Introduction
Driven by the availability of big data, machine learning plays an essential role in the filed of smart healthcare (Garg and Mago 2021;Magoulas and Prentza 1999).There are many related applications such as prediction of disease progression (Huang et al 2019;Brisimi et al 2018), medical image analysis (Li et al 2019;Roth et al 2020), ancillary diagnosis (Qayyum et al 2020), and so on.However, we have to consider following things.For one thing, more data need to be collected to improve the performance of models.For another, medical data such as health records, gene sequences, biometric data, medical image, are very sensitive and private, and it is difficult to gather or transfer between different organizations.What is more, with the increasing awareness of data security and user privacy, the behavior is almost impossible occurred and even forbidden by relevant laws and regulations, such as the General Data Protection Regulation (GDPR).
To solve the problem of "isolated data island", the concept of federated learning is proposed (McMahan et al 2017).Depending on how data are split across parties, the idea was expanded to three categories (Yang et al 2019a): Horizontal Federated Learning (HFL) (Shokri and Shmatikov 2015;Liu et al 2022a;Aono et al 2017), Vertical Federated Learning (VFL) (Abuadbba et al 2020;Chen et al 2021;Liu et al 2022b), and Federated Transfer Learning (FTL) (Liu et al 2020a;Gao et al 2019).As an example, several hospitals have similar patient feature data but few patient samples overlap, then they can perform HFL tasks to obtain a common global medical model by sharing the same feature space.. Similarly, the VFL situation is when data is vertically split.For instance, hospitals and medical institutions usually have different features space but same samples spaces, they can perform VFL tasks and get a shared model.FTL solutions can be used when data differs not only in samples but also in features space.There have lots of works on FL, most of them focus on HFL.There is a gap in the research on VFL.In this paper, the VFL is the main topic.
VFL has a great future for applications in industries such as finance and healthcare.However, motivated by the rapid growth in VFL research and real-world applications, VFL is dealing with more demands for customized privacy protection.For example, Fig. 1 illustrates the additional security requirements in medical scenarios.
To train the federated model for medical applications, the medical company combines the hospital.Both of them contain unique sensitive input data, meanwhile, the medical company, as the requester of VFL, has a sensitive intention for the model training.More concretely, this intention is to find the target data for training safely, which includes target features and target samples.Furthermore, if the privacy of the intention is compromised, it may negatively impact the requester's self-interest and result in the task's failure.A medical company's training program, as an illustration, aims to create a new medicine.If the plan is revealed to competitors, it will inevitably result in the medical company's interests being lost.In addition, in the financial field, for instance, the intention of a financial company's training model is to predict the credit ability of a specific target customer (e.g. one with an annual salary $100,000), if the intention is reveled to the user, it will inevitably result in a loss of trust between the user and the financial company.We can see that the training intention is private information for the requester of VFL and cannot be disclosed to anyone, including the participants.Therefore, the additional privacy issue in VFL models must be addressed.
To the best of our knowledge, existing papers about VFL usually only focus on the privacy protection of model training, which ensure that original data do not compromised.In this paper, we consider an additional privacy requirement, intention-hiding of the training model in VFL systems.For example, there are two medical institutions C and S , C is a medical company, while S is a hospital with a large patient database.Now, C com- bines S to train a model to improve the quality of its products and services.Meanwhile, C hopes that S cannot obtain its intention and S hopes C cannot obtain its data, since they have their own privacy protection requirements.For S , it need protect the patient data from being compromised.For C , the intention of model training involves its business interests.Therefore, this is a typical VFL scenario with additional requirements.Compared Fig. 1 The privacy-preserving requirement of intention-hiding for medical data scenarios with the traditional VFL scheme, we not only protect the data during the model training process, but also protect the intention of model training.
To achieve the goal of intention-hiding in medical data applications, we propose the idea of intention-hiding vertical federated learning and construct a framework to meet the privacy-preserving requirements.Our main contributions in this paper are summarized as follows: • We propose and characterize the idea of intentionhiding vertical federated learning (IHVFL).Compared traditional VFL, it satisfies the need for additional privacy requirement of intention-hiding.The remainder of this paper is organized as followers."Introduction" section presents the relevant background of this paper as well as our motivation and contributions."Related works" section describes the related work on FL for medical data and the directions and concerns in VFL."Preliminaries" section introduces the preliminaries of our work."Definitions" section defines the concept of Intention-Hiding VFL and its security and privacy requirements."IHVFL with Logistic Regression" section demonstrates the details of our proposed framework."Security analysis" section presents the security proofs for the proposed protocol."Experiments" section shows the results of the comparison experiment and evaluate the performance of our scheme.Finally, the conclusions of this research and future directions are summarized in "Conclusions" section.

Federated learning applications for medical data
Federated Learning (FL) has many applications in smart healthcare (Rieke et al 2020).For example, electronic health records (EHR) contain a lot of clinical medical information, and it has great use in medical diagnosis.Huang et al (2019) made use of the EHRs across different hospitals to predict mortality rate for heart disease patients.Brisimi et al (2018) used the cluster Primal Dual Splitting (cPDS) algorithm to predict whether a patient with heart disease will be hospitalized.Moreover, FL has emerged as a promising solution for supporting medical imaging tasks by learning from multi-source datasets without sharing data.Li et al (2019) used the deep neural networks (DNNs) to support brain tumour segmentation, and to protect patient privacy leakage, a differential privacy technique is adopted during the model training.A real-world implementation of FL for medical imaging was presented in Roth et al (2020), they used the FL framework to make breast density classification and the performance is better than the standalone learning approaches.

Directions and concerns in VFL
VFL has the excellent performance in smart healthcare (Sun et al 2019;Brisimi et al 2018).In most works of VFL (Hardy et al 2017;Yang et al 2019a), there is a third party to assist the model training, which is a collaborator role in the VFL systems.However, the centralized VFL suffers from a single point of failure and increases the potential information leakage risk.In addition, it is difficult to find a fair and credible third-party in practice.In order to prevent inference of local data from intermediate results (Zhu et al 2019), most existing works are based on Homomorphic Encryption (HE) to achieve the security goals (Aono et al 2016(Aono et al , 2017)).Besides, Secret Sharing (SS) technologies are also popularly used to build the VFL systems (Mohassel and Zhang 2017).Another line of works uses Differential Privacy (DP) (Dwork 2008;Sun et al 2020), which usually involve a trade-off between accuracy and privacy.However, They just focus on the privacy protection of model training, little attention is paid on the other additional privacy protection requirements.
More and more security goals are receiving attention.Recently, considering the asymmetry of participants data, Liu et al (2020b) proposed the concept of asymmetrical vertical federated learning.They divide the participants into 'weak' and 'strong' parties based on the amount of data in the system.They construct an asymmetrical sample alignment protocol to protect the privacy of weak participant.Similarly, Sun et al (2021) started from protecting intersection membership of all parties, proposed a private set union protocol to solve the problem.Instead of identifying the intersection of samples, they take the generated union of samples as training instances.From the perspective of improving data quality, Chen et al (2022) proposed an explainable VFL framework and provided the importance rate as the metric for evaluating the importance of the features.Considering the label privacy in medical scenarios, Fu et al (2022) proposed a new label attack method and reveal hidden privacy issues in VFL system.
To better understand the current related work, we summarize the above schemes and make a comparison table, and list the addressed challenges in Table 1, respectively.We can see that with the applications of VFL in real-world situations, more and more potential privacy protection concerns are being considered.

Preliminaries
In this section, we describe the setting and threat model of our proposal, and present some background knowledge.All the main notations used in this paper are shown in Table 2.

Vertical federated learning
In the federated learning settings, when the data are distributed vertically, i.e., they share the same sample ID space but differ in feature space, we call it vertical federated learning.Let D = (I, X , Y) denotes a complete dataset with I, X and Y , which represent the sample ID space, feature space and label space, respectively (Shokri and Shmatikov 2015).In the classic two-party vertical federated learning scenario, there are two datasets We call the party with labels as active party C , and the party without labels as passive party S.
In VFL systems, distributed parties should share the same sample ID space.Therefore, the preparation work is to find the matching sample ID among the parties, and get the same sample space R = I c ∩ I s .This phase of sample alignment is commonly did by Private Set Intersection (PSI) protocol.Next, both parties train the model collaboratively by exchanging the intermediate results for gradient or model.What is more, the intermediate results are masked by encryption, differential privacy or secret sharing techniques.Finally, each party holds a share of model associated to their features.

Logistic regression
Logistic regression is a classical machine learning algorithm and has been used extensively in medical statistical analysis.The key components of the two parties vertical logistic regression can describe as follows, active party C  The index set with target samples The index set with aligned samples Permutations and inverses of permutations They aim to learn a model w by mini- mizing the loss function: and σ (u) = 1 1+e −u is known as the sigmoid function.In this paper, we use the second order Taylor expansion (Hardy et al 2017) to make the sigmoid function is cryptographically friendly.
To efficiently learn the model, the mini-batch SGD algorithm for model trains and updates as follows: where |B| is the batch size, α is the learning rate, and the gradient is denoted as

Private set intersection
Private Set Intersection (PSI) is a preparation work for the VFL, which is to find the public sample intersections.
Here we introduce an intersection-hiding PSI.
Let R c = (c i , x c i ) be the set of tuples of (identifier, val- ues) associated with active party, and the x i represents a vector of i-th records.Similarly, R s = (s j , x s j ) be the set of passive party.Let the intersection be I = {(x c i , x s j )} for all i, j where c i = s j and the size of intersection is i mod 2 ℓ and r s i + p s i = x s i mod 2 ℓ , for an agreed upon integer ℓ .Finally, the intersec- tions are distributed between the participants as indistinguishable shares, which achieves the purpose of intersection-hiding.

Homomorphic encryption
Homomorphic Encryption (HE) (Paillier 1999;Tang et al 2022) is an encryption scheme that allows computations on ciphertexts and the computation results are matched those of plaintext computations.Due to its significantly greater compute efficiency, additive homomorphic encryption is widely used in the field of federated learning.It mainly has following steps: (1) • ParamGen(1 ) → pp: is a security parameter, and the public parameter pp is implicitly fed in following algorithms.• KG(pp) → (pk, sk): Input a public parameter, output a key pair (pk, sk).And pk is public key, while sk is secret key.

Secret sharing
Secret sharing (SS) is a classic method in Multi-Party Computation (MPC) (Shamir 1979).For example, Alice and Bob want to share a secret.Let the secret x is ℓ-bit, Alice randomly generates an integer r ∈ Z 2 ℓ as x 1 , then calculate and send x 2 = x − r mod 2 ℓ to Bob.At Last, Alice and Bob get the secret shares that meet �x� 1 + �x� 2 = x mod 2 ℓ , respectively.Similarlly, Bob shares a secret y and then Alice gets y 1 , Bob gets y 2 .

Additive secret sharing (ASS)
ASS is used to compute the result of x + y .Assume that Alice has x 1 , y 1 and calculates �z� 1 = �x� 1 + �y� 1 mod 2 ℓ .Similarlly, Bob calculate �z� 2 = �x� 2 + �y� 2 mod 2 ℓ and each of them get the shares of results.At last, they exchange their shares and get the result of x + y = �z� 1 + �z� 2 mod 2 ℓ .

Multiplicative secret sharing (MSS)
MSS is used compute the result of x • y by using their shares.The implementation of MSS usually requires the help of Beaver triples (Beaver 1991).A Beaver triple consists of three random numbers a, b, c such that c = a • b , and it is private and secure for its owners.Now, Alice and Bob take shared secrets x and y as input and get the z = x • y as output.Firstly, they calculate �e� = �x� − �a� and �f � = �y� − �b� , respectively.Next, they exchange the shares and reconstruct the e and f.One is mention that since a and b are random and private number, open e and f does not leak information about x and y.Finally, Alice calculates Finally, they can get the �z� 1 + �z� 2 = x • y.

Additive secret resharing (ASR)
By modifying the protocol so that the results of additive secret sharing can continue to be used for multiplicative secret sharing (Xia et al 2021).In other words, the shared secret over ASS is converted to the shared secret over MSS by additive secret resharing.As mentioned above, Alice and Bob take x 1 and x 2 as input, get the z 1 , z 2 as output thus that �z� 1 • �z� 2 = x .First, Alice calculates and sends e = (�x� 1 − �c� 1 )/a .Second, Bob calculates �z� 2 = e + b , d = (�x� 2 − �c� 2 )/�z� 2 , and sends d to Alice.Then Alice calculates �z� 1 = d + a .Finally, they can get z 1 and z 2 , respectively, thus that

Definitions
In this section, we formally describe the notion of Intention-hiding Vertical Federated Learning.

IHVFL for medical data
Let C , S represent active and passive party in VFL, respectively.C combines S to train a model.For exam- ple, the intention of C is to train a diabetes prediction model on the older population.To do this, C needs to get the target features about diabetes with the protocol of secure features screening.Meanwhile, C also needs to obtain the target samples that are older than 60 years old with the protocol of secure samples screening.Next, they use the aligned target data to train the model jointly.
In this process, besides the data privacy, additional privacy-preserving requirements are the target features and target samples, which represent the intention of model training.Formally, let F = {d, s} denote the intention of C , where d represents the features of target data and s represents the samples of target data.For example, in our demo above, d is the features of diabetes model and s is the samples that satisfy the age older than 60.If C does not leak the F to S in the process of VFL, we consider that it has achieved the goal of intention -hiding, and denote it as Intention-Hiding Vertical Federated Learning (IHVFL).We achieve the goal in semi-honest model (Chen et al 2021;Mohassel and Zhang 2017) and illustrate the architecture of IHVFL in Fig. 2.

Security and privacy requirements of IHVFL
In the system of IHVFL, a key point is the stage of data preparation.First, C performs the privacy-preserving screening protocols on the data set of S to get the target features of diabetes.In this process, C cannot get the row feature data of S , and S cannot know the features that C selected.Next, C gets the target samples that meet the con- dition of person aged older than 60.In this stage, C cannot get the row sample data of S , and S cannot know the con- dition.More specifically, the security and privacy requirements of the IHVFL are constructed via following aspects: • The features that C selected in the model training are hiding for S .To do this, a secure features screening protocol is needed to ensure that C can securely get the target features from S. • The samples that meet the condition of C are hiding for S .To do this, a secure samples screening protocol is needed to ensure that C can securely get the target samples without leaking conditions.

IHVFL with logistic regression
To achieve the intention-hiding in VFL system, in this section, we investigate the privacy-preserving feature engineering and the intersection-hiding PSI protocol in the process of data preparation, and propose a novel and general approach to training the model in IHVFL.First, we construct a secure screening protocol by combing HE and SS to enhance the ability of privacy-preserving in data preparation.Next, we describe the solution of private set intersection with secret shares.Finally, as an example, we chose logistic regression, a classical algorithm widely used in medical data (Caruana et al 2015;Jothi et al 2015), to describe the procedure of intention-hiding federated model training.

Privacy-preserving feature engineering
To get the target data, the passive party S needs to pub- lish a feature statement to the active party C .Next, C selects the target data for model training.It is worth mentioning that C and S should determine the target fea- tures by secure federated feature engineering (Fang et al 2020).
Let D m×n be the data of S , m is the number of sam- ples and n is the number of features.C obtains the tar- get data D i×j from S by the privacy-preserving feature engineering, which includes two stages, features screening and samples screening.In the first, C gets the shares obtaining the target features.In the second, C screens the shares with a secret condition, and gets the shares that satisfy the condition.Finally, both parties get the target shares for downstream computation.A key point to note that we construct the protocol based on secret sharing and permutations techniques, it is secure under the DDH problem (Buddhavarapu et al 2020).

Features screening
In order to achieve the purpose of features screening, C needs to know a statement σ , which is about the fea- ture definition and declaration of the passive party S in advance.For example, the σ = (age,glucose,bmi,blood pressure,cholesterol), C selects features about diabetes using the σ , calculates and sends shares to S .We assume that S has one share {�d 1 � 1 , �d 2 � 1 , ... �d j � 1 } , while C has another share {�d 1 � 2 , �d 2 � 2 , ... �d j � 2 } .Particularly, to bet- ter protect the target features in features screening, data is need to be disrupted by C with a predefined permu- tation π .The process of recovering this permutation is denoted as π −1 .This step is to ensure that the party being screened cannot distinguish which features are chosen.We describe the process in Algorithm 1.

Samples screening
In some scenarios, we still need to screen the samples.For example, to build a prediction model for diabetes inadults above 60, we need to screen the samples with age older than 60.For this purpose, we designed a secure samples screening protocol based on the ASR.With the shares of features screening, C takes a screening vector ρ = {60, 0, 0, 0, 0, 0} , which means that C wants to screen the first feature of age and the value to be screened is '60' .Meanwhile, C has a conditional vector τ ← {−1, 0, 1} * , which means that the feature at the associated index is whether satisfy the condition or not.Such as, '-1' denotes the value < '60' , '0' denotes value = '60' and '1' denotes value > '60' .Therefore, C can get the samples shares of S that meet the condition by secure comparing.The details of secure samples screening are presented in Algorithm 2.
We describe the execution process of the above two protocols in Fig. 3.We can observe that C gets the shares that obtained the target features after executing the protocol of features screening.Next, C gets the shares con- taining the target samples through a protocol of secure samples screening.Finally, the target data is distributed between two parties in the form of shares.

PSI based on secret shares
We construct a PSI protocol based on secret shares.Unlike with the traditional PSI, our approach not only protects samples outside the intersection, but also the intersection ID.That is to say, all participants do not know the specific information about the intersection, such as whether a sample exists in the intersection.This is because the intersection is distributed among the participants in the form of indistinguishable shares.
As mentioned above, after executing the secure screening protocol in "Privacy-preserving feature engineering" section, C gets the target shares of S .Mean- while, S needs to get the target shares of C .Therefore, C performs the same strategy to screen its features and samples.Next, C encrypts and sends them to S .At this time, S calculates the shares, and sends them to C after a shuffle with π s .Finally, C decrypts them and gets another part of shares.It is worth mentioning that this shuffle step is to prevent C from knowing the sequence of samples, which ensure that C dose not know the intersection ID in the process of PSI.In fact, during the implementation of the protocol, the samples of both parties are disturbed by the other party, and the restoration is finished prior to the intersection comparison, so as to prevent the other party from inferring additional information according to the sequence of the samples.The process of the PSI based on target shares following next manners: 1. Calculate and exchange markings: (i) C computes H(c i ) c using a random scalar c for all i and sends them to S (ii) S computes H(s m ) s using a random scalar s for all m, then computes H(c i ) c × s , shuffles with π s and sends them to C (iii) C selects the target records from H(s m ) s with L 1 and π −1 c , gets H(s j ) s and computes H (s j ) s × c

Calculate intersections and output shares:
(i) C determines whether the shares are aligned by ) and sends the index (i, j) to S (ii) S finds the corresponding shares by (i, j) and gets R s, It is important to note that we do not seek to improve PSI performance and we just present a shares-based PSI approach that meets our demands for privacy preservation.Therefore, we construct a PSI protocol that based the DDH with random shuffling (Buddhavarapu et al 2020).A formal description of the protocol shows in Algorithm 3.

Intention-hiding vertical logistic regression
The downstream computational procedure can be easily constructed based on the aligned secret shares.Now, we introduce the process of intention-hiding model training using logistic regression.

Secure matrix multiplication overview
As we describe above, matrix multiplication operations plays a key role in logistic regression, so we construct a protocol by combing the homomorphic encryption and secret sharing, which implements the secure matrix multiplication between two parties when data is distributed vertically.Similarly, let C and S represent the active party and passive party, respec- tively.They want to compute the product of matrices X and Y securely.First, C encrypts the X and sends X to S .Next, S calculates X • Y, and shares it in the ciphertext additive operation.Finally, C decrypts and gets the shares of the product of matrices.More details can see Algorithm 4.

Logistic regression with shares
We now introduce the procedure of intention-hiding vertical logistic regression.First, C and S input their shares that generated from the secure screening protocol and the initialized models.Next, they collaboratively calculate the shares of prediction with the SecMM protocol.After they get the shares of error, they can calculate the shares of gradients to finish model updating.This procedure is described in Algorithm 5. Finally, they can obtain their particular model, respectively.It is important to note that although they can get access to their own gradients and models during each iteration, they do not have additional access to each other's private information since that the data and labels are shared consistently.

Security analysis
In this section, the security of the proposed IHVFL framework will be analyzed and proved in detail.

Security definition
We use simulation-based definitions of security for secure two-party computation to prove that the protocol is secure against a semi-honest adversary.Let F be the functionality computed by the two-party protocol , P i and x i represents the party and party's input, where i ∈ (1, 2) .The view of P i 's consists of its input, random- ness r and the exchanged messages throughout the protocol , which is denoted as VIEW P i .
Definition 1 Protocol securely computes function F against a semi-honest adversary if there exists two probabilistic polynomial-time (PPT) simulators SIM 1 and SIM 2 , such that where K is security parameter, and ∼ = denotes computa- tionally indistinguishablity.We prove the above equations for a semi-honest C and a semi-honest S , respectively.

Security analysis of SFS
In Algorithm 1, we can see that the messages obtained by C include the ciphertext d c and the output I c , the mes- sages obtained by S only include the output I s .For C , its private input do not leave local.For S , its private input is protected by HE technology, as long as the private key is not disclosed, the private input is safe.Formally, we have the following theorem.
Theorem 1 (Security of SFS against a semi-honest C ). Assume that additive HE scheme is indistinguishable under chosen-plaintext attacks.Then the protocol of SFS is secure in Definition 1.  3) holds.This completes the proof of security of SFS in case of a semi-honest C .

Proof
Theorem 2 (Security of SFS against a semi-honest S ).Assume that additive HE scheme is indistinguishable under chosen-plaintext attacks.Then the protocol of SFS is secure in Definition 1.
Proof We Construct a PPT simulator SIM S to simulate the view of S in the protocol execution.For the functionality F SFS , VIEW � S (σ , D, K, I c , I s ) consists of S 's input D , randomness r s , the obtained ciphertext d s and I s .Given K , D , I s , SIM S generates a simulation of VIEW � S (σ , D, K, I c , I s ) as follows.It encrypts D with pk s , shuffle them randomly and obtain d s ′ .Then, it gener- ates ( D, I s , r s , d s ′ ) as the output.Therefore, we can get the following two equations: It is observed that both d s and d s ′ serve as the cipher- text for d s , and they appear indistinguishable to S .Con- sequently, the probability distributions of S 's view and SIM S 's output are identical.Hence, we claim that Eq. ( 3 holds.This completes the proof of security of SFS in case of a semi-honest S .

Security analysis of SSS
In Algorithm 2, we can see that the messages obtained by C include d and µ 2 .For C , its local calculation data includes ρ 1 and υ 1 , as long as the triples held by S are not leaked, then d and µ 2 held by C are indistinguishable ran- dom values.For S , the messages it obtained include ρ 2 , e and L 1 .Similarly, as long as the triples held by C are not leaked, then the messages held by S are indistinguishable.It is worth mentioning that since both C and S hold L 1 , they will both know the size of the target samples set, but cannot determine whether a certain sample is in the set.Formally, we have the following theorem.
Theorem 3 (Security of SSS against a semi-honest C ). Assume that the triples are random and secure.Then the protocol of SSS is secure in Definition 1.  3) holds.This completes the proof of security of SSS in case of a semi-honest C .

Proof
Theorem 4 (Security of SSS against a semi-honest S ).Assume that the triples are random and secure.Then the protocol of SSS is secure in Definition 1. (8) Proof We Construct a PPT simulator SIM S to simulate the view of S in the protocol execution.For the functionality F SSS , VIEW � S (ρ, τ , I c , I s ) consists of S 's input I s , randomness r s , the obtained value ρ 2 , e, L 1 and output I s ′ .Given K , I s and I s ′ , SIM S generates a simulation of VIEW � S (ρ, τ , I c , I s ) as follows.It randomly select a value ρ 2 ′ , e ′ , L ′ 1 and generates ( I s , I s ′ , ρ 2 ′ , e ′ , L ′ 1 ) as the output.Therefore, we can get the following two equations: It is observed that ρ 2 , e, L 1 and ρ 2 ′ , e ′ , L ′ 1 are indistinguishable random values to S .Consequently, the probability distributions of S 's view and SIM S 's output are identical.Hence, we claim that Eq. ( 3) holds.This completes the proof of security of SSS in case of a semi-honest S .

Security analysis of IH-PSI
In Algorithm 3, we can see that the messages obtained by C include U s and E c , its input includes c, x c 1 , x s 1 , π c and its output is R c,I .Therefore, we have We can construct a PPT simulator SIM C to simulate the view of C in the protocol execution, and it generates a simulation of VIEW C as follows.First, it generates c honestly.Next, for each i ∈ [1, I] , SIM C randomly choose and (U s , U s ′ ) are indistinguishable to C .Formally, we have the following theorem.
Theorem 5 (Security of IH −PSI against a semi-honest C ). Assume that the DDH problem is hard, then the proto- col of IH-PSI is secure in Definition 1. (10) Proof Using a sequence of hybrid arguments, we show that the distribution generated by SIM C is indeed indistinguishable from the VIEW C .
H 0 : This is the view of C in the real execution of IH −PSI .
H 1,i : For i ∈ [1, I] , the same as H 1,i−1 except that we replace H(c i ) c × s in E c with g c i , and g i is randomly selected element in G.
H 2,j : For j ∈ [1, J ] , the same as H 2,j−1 except that we replace H (s j ) s in U s with a randomly selected element (e.g.g j ) in G.
To begin with, we argue that H 1,i−1 and H 1,i are indis- tinguishable to C .For any PPT adversary A who can dis- tinguish the two hybrids, we devise a challenger B can solve the DDH hard problem.B is given (g, g a , g b , g c ) and needs to decide whether c is random or c = ab .First, given input c i , B can program H (•) and return g b .We let g a = g c , for the challenge markings m, B cannot reply m belongs to H 1,i−1 or H 1,i , and send m to A .For A , the markings m = g c s if c = ab , otherwise markings m = g c i (since g i is uniformly random).Therefore, if A judges m belongs to H 1,i−1 , then B outputs c = ab ; if A judges m belongs to H 1,i , then B outputs c is random.That is to say, if A can distinguish which hybrids are markings, then B can solve the DDH problem with the same probability.
Next, we argue that H 2,j−1 and H 2,j are indistinguishable to C .For any PPT adversary A who can distinguish the two hybrids, we devise a challenger B can solve the DDH hard problem.B is given (g, g a , g b , g c ) and needs to decide whether c is random or c = ab .First, given input s j , B can program H(•) and return g b .We let g a = g s , for the chal- lenge markings m, B cannot reply m belongs to H 2,j−1 or H 2,j , and send m to A .For A , the markings m = g c when c = ab , otherwise m = g j (since g j is uniformly random).Therefore, if A judges m belongs to H 2,j−1 , then B out- puts c = ab ; if A judges m belongs to H 2,j , then B outputs c is random.That is to say, if A can distinguish which hybrids are markings, then B can solve the DDH problem with the same probability.This completes the proof of security of IH −PSI against a semi-honest C .
Similarly, for S , its obtained messages include U c and L 2 , its input includes s, x c 2 , x s 2 , π s and its output is R s,I .Therefore, we have we can construct a PPT simulator SIM S to simulate the view of S in the protocol execution, and it generates a simulation of VIEW S as follows.First, it generates s honestly.Next, for each i ∈ [1, I] , SIM S randomly choose g i ← G and let U c ′ = U c ′ ∪ {g i } .Meanwhile, it generates L 2 using the index of R s,I .Finally, we have From VIEW S and SIM S , we need to discuss that (U c , U c ′ ) is indistinguishable to S .Formally, we have the following theorem.
Theorem 6 (Security of IH −PSI against a semi-honest S ).Assume that the DDH problem is hard, then the proto- col of IH-PSI is secure in Definition 1.
Proof Using a sequence of hybrid arguments, we show that the distribution generated by SIM S is indeed indistinguishable from the VIEW S .H 0 : This is the view of S in the real execution of IH −PSI .
H 1,i : For i ∈ [1, I] , the same as H 1,i−1 except that we replace H(c i ) c in U c with a randomly selected element (e.g.g i ) in G.
H 2 : The view of S output by SIM IH −PSI S .
We argue that H 1,i−1 and H 1,i are indistinguishable to S .For any PPT adversary A who can distinguish the two hybrids, we devise a challenger B can solve the DDH hard problem.B is given (g, g a , g b , g c ) and needs to decide whether c is random or c = ab .First, given input c i , B can program H(•) and return g b .We let g a = g s , for the chal- lenge markings m, B cannot reply m belongs to H 1,i−1 or H 1,i , and send m to A .For A , the markings m = g c if c = ab , otherwise markings m = g i (since g i is uniformly random).Therefore, if A judges m belongs to H 1,i−1 , then B outputs c = ab ; if A judges m belongs to H 1,i , then B outputs c is random.That is to say, if A can distinguish which hybrids are markings, then B can solve the DDH problem with the same probability.( 14) This completes the proof of security of IH −PSI against a semi-honest S .

Security analysis of SecMM
In Algorithm 4, we can see that the messages obtained by C is the ciphertext Z 1 , the messages obtained by S is the ciphertext X .For C , its private input is protected by HE technology, as long as the private key is not disclosed, the private input is safe.For S , its private input is hidden in the ciphertext Z 1 .Formally, we have the following theorem.(16) Proof We Construct a PPT simulator SIM S to simulate the view of S in the protocol execution.For the functionality F SecMM , VIEW � S (X, Y, K, pk c , sk c ) consists of S 's input Y , randomness r s and the obtained ciphertext X .Given K , pk c , sk c , Y and Z 2 , SIM S generates a simula- tion of VIEW � S (X, Y, K, pk c , sk c ) as follows.It randomly selects a matrix X ′ , encrypts it with pk c and obtains X ′ .Then, it generates ( Y, r s , X ′ ) as the output.Therefore, we can get the following two equations: It is observed that as the additive HE is indistinguishable under chosen-plaintext attacks, the probability distributions of S 's view andSIM S 's output are computationally indistinguishable.Hence, we claim that Eq. ( 3) holds.This completes the proof of security of SecMM in case of a semi-honest S .

Security analysis of IH-VLR
In algorithm 5, We can see that during the model training process, C and S finish interactive computing tasks by the protocol of SecMM, and other computing tasks are finished locally.Therefore, algorithm 5 is secure, if the protocol of SecMM is secure.Formally, we have the following theorem.
Theorem 9 (Security of IH −VLR against a semi-honest C ). Assume that the protocol of SecMM is secure against a semi-honest C , then the protocol of IH-VLR is secure in Definition 1.
Theorem 10 (Security of IH −VLR against a semihonest S ).Assume that the protocol of SecMM is secure against a semi-honest S , then the protocol of IH-VLR is secure in Definition 1.
Proof Based on the above analysis, see the proof of Theorem 7 and Theorem 8 for more details.

Experiments
In this section, we provided a few experiments to validate the feasibility and performance of our scheme. (18)

Dataset description
We use two classical classification benchmark datasets from the UCI repository in our experiments: Diabetes (Smith et al 1988) and Breast Cancer (Bache and Lichman 2013).
• Diabetes: This dataset consists of medical measurements that correspond to 768 female patients older than 21 years old.Each sample has 7 features which includes blood pressure, body mass index, age and plasma glucose concentration, etc.

Implementation settings
We implement the system in Python, and all experiments are performed on an Intel Core i7-7500U @ 2.70GHz, 2 CPU cores and 12GB RAM.In our settings, we assume that the secure screening process has been completed, which means that the input to the model is a shares of the data.Meanwhile, we omit the performance analysis of PSI and only focus on the part of model training.Besides, we divide the dataset vertically into two parts and distribute them to party C and S .Table 3 describes the partition details.

Parameters settings
We use the paillier (Paillier 1999) as our additive HE scheme and set the key length to 1024 bits.For diabetes dataset, we set the batch size, iteration and learning rate are 64, 30, 0.1.For breast cancer dataset, such settings are 32, 30, 0.05, respectively.In addition, we set the training and test sets according to the ratio of 7:3.

Comparison methods
To evaluate the effectiveness of our scheme, we make some comparison experiments with existing related work, which also use the two-party LR model.

Comparisons results and analysis
We first analyze the complexity of secure screening protocols.Assume that the size of dataset is n×m and the target feature number is k, and then the time complexity of the secure screening protocols is about O(nm × T Enc + nk × T Add + nk × T Dec ) , where the T Enc , T Add , T Dec represents the single encryption, homo- morphic addition and decryption time, respectively.It is clear that the cost of protocol execution increases with the size of dataset, which means that when the size of data is large it can seriously affect the efficiency of protocol.In this way, optimizing the secure screening protocol will be our future work.

Effectiveness
As shown in Table 4, we test our proposed scheme with related works.In our experiments, the baseline is the model trained in a plaintext manner.It is clear that the baseline has a best performance in all evaluation metrics, which is as expected.We also test the main metrics comparison with respect to iterations on different schemes.In Fig. 4, the loss function converges very fast.What is more, the loss on diabetes Fig. 4a and breast cancer Fig. 4b tends become stable and nearly reach the same with the number of iterations increase.This is because they all use efficient optimization solutions on sigmoid function, i.e., Taylor expansion (Hardy et al 2017) and Minimax approximation (Chen et al 2018).In Table 4 The performance of IHVLR with different datasets In order to distinguish it from the experimental results of other schemes, we display the experimental data of our scheme in bold Figs. 5 and 6, we test the accuracy and AUC of models, there are small differences in performance across datasets.For breast cancer dataset, the two metrics performance well with different schemes in Figs.5b and 6b.For example, when the iteration reaches 30, the AUC is 0.999 and the accuracy reaches 96%, which the best performance is Baseline, followed by Yang et al (2019b) and finally SS+HE scheme.For diabetes dataset in Fig. 5a, we observe that our scheme just drop by 0.4% compared the baseline and other works (Yang et al 2019b; Chen et al 2021) on accuracy.In Fig. 6a, we can see that our scheme consistently achieve better AUC than others besides work (Hardy et al 2017), although they eventually converged.Therefore, our scheme is able to achieve better performance in a shorter time.

Efficiency
To evaluate the efficiency of our work, we test the runtime of different scheme.From the Table 4, we can see that the work (Yang et al 2019b) has better performance in VFL settings, since that it has less homomorphic operations.For the 'SS+HE' scheme, it has more encryption operations based the shares, which costs much time compared to 'HE' scheme (Hardy et al 2017).In addition, for work (Chen et al 2021), it shares the model while we share the data, so our scheme has much time cost on the matrix multiplication operations.Fortunately, the additional cost is acceptable.Besides, the parties communicate using local sockets in our settings, so the network latency is not measured in experiments.

Security
To begin with, the plaintext scheme has best performance, but it does not fit the VFL scenarios.Besides, the other schemes satisfy the basic security definition in VFL.For HECLR, it needs a trusted third-party, which is not allowed or has security problems in some scenarios.For HELR and SSHELR,they remove the third-party and solve the problem of data sparse in specific scenarios, respectively.However, none of them consider the security of the data preparation process.Our proposed scheme based the requirement of intention-hiding, not only achieves privacy enhancement but also guarantees model performance.

Conclusions
In this paper, we studied the intention-hiding in model training to solve the privacy-preserving requirements in real-life applications.To do this, we first proposed the idea of intention-hiding vertical federated learning.First, we constructed two secure screening protocols to enhance the feature engineering, and then we presented a new PSI protocol to achieve the sample alignment.Next, we used the logistic regression to present the process of intention-hiding vertical federated learning by combing homomorphic encryption and secret sharing.Finally, we implemented the framework and conducted experiments on it.In future, We will explore more solutions to optimize the efficiency of framework, and expand our framework to more machine learning models.

•
Enc(pk, m) → c: Given a plaintext message m, it is encrypted with pk and generate a ciphertext c. • Dec(sk, c) → m: Given a ciphertext c, it is decrypted with sk and return a plaintext m. • Homomorphic operation: Given a ciphertext a and b , the addition is a + b = a + b ; Given a cipher- text a and a plaintext b, the multiplication is a * b = a * b .

Fig. 2
Fig. 2 Architecture of the intention-hiding vertical federated learning for medical data scenarios

Fig. 3
Fig. 3 Process of secure screening protocols We Construct a PPT simulator SIM C to simulate the view of C in the protocol execution.For the functionality F SFS , VIEW � C (σ , D, K, I c , I s ) consists of C 's input σ , randomness r c , the obtained ciphertext d c and I c .Given K , σ , I c , SIM C generates a simulation of VIEW � C (σ , D, K, I c , I s ) as follows.It randomly select a matrix d c′ , encrypts it with pk s and obtain d c ′ .Then, it generates ( σ , I c , r c , d c ′ ) as the output.Therefore, we can get the following two equations: It is observed that both d c and d c ′ serve as the cipher- text for d c , and they appear indistinguishable to C .Con- sequently, the probability distributions of C 's view and SIM C 's output are identical.Hence, we claim that Eq. ( We Construct a PPT simulator SIM C to simulate the view of C in the protocol execution.For the functionality F SSS , VIEW � C (ρ, τ , I c , I s ) consists of C 's input ρ, τ , randomness r c , the obtained value d, µ 2 and output I c ′ .Given K , ρ , τ , I c and I c ′ , SIM C generates a simulation of VIEW � C (ρ, τ , I c , I s ) as follows.It randomly select a value d ′ , µ ′ 2 , and generates ( ρ, τ , I c , I c ′ , r c , d ′ , µ ′ 2 ) as the output.Therefore, we can get the following two equations: It is observed that d, d ′ and µ 2 , µ ′ 2 are indistinguishable random values to C .Consequently, the probability dis- tributions of C 's view and SIM C 's output are identical.Hence, we claim that Eq. (

Fig. 4 Fig. 5
Fig. 4 Loss comparison with respect to iterations over different schemes • We enhance the work of privacy-preserving feature engineering and propose a new PSI protocol to implement the sample alignment in VFL.• We construct the logistic regression algorithm with intention-hiding for two parties to describe the process of IHVFL in details.
• We provide extensive experiments on public medical data to validate the feasibility of our proposed scheme.For example, the results show that our model has better efficiency (less than 5min) and accuracy (97%) on Breast Cancer dataset.

Table 1 Comparisons for related work with proposed scheme References Description Addressed challenges
For the intersection membership privacy across privacy-sensitive organizations, this article proposed a VFL framework, allowing each party to preserve private sensitive membership informationIntersection membership privacyThis articleFor the privacy protection requirements of hiding model training intention, this article proposed an Intention-Hiding VFL framework to achieve privacy enhancement in VFL Intention privacy

Table 2
Notations and descriptions Theorem 7 (Security of SecMM against a semi-honest C ). Assume that additive HE scheme is indistinguish- able under chosen-plaintext attacks.Then the protocol of SecMM is secure in Definition 1.Proof We Construct a PPT simulator SIM C to simulate the view of C in the protocol execution.For the functionality F SecMM , VIEW � C (X, Y , K, pk c , sk c ) consists of C 's input X, randomness r c and the obtained ciphertext Z 1 .Given K , pk c , sk c , X and Z 1 , SIM C generates a simula- tion of VIEW � C (X, Y, K, pk c , sk c ) as follows.It encrypts Z 1 with pk c and obtains Z 1 , and they appear indistinguishable to C .Con- sequently, the probability distributions of C 's view and SIM C 's output are identical.Hence, we claim that Eq. (3) holds.This completes the proof of security of SecMM in case of a semi-honest C .

Table 3
Vertical partition of the used dataset

•
Breast Cancer: It contains 569 samples with 30 dimensions, which 357 are benign, and 212 are malignant.It is also a binary classification dataset.