In this paper, decision tree is used to search for pathogenic SNP loci. Differential privacy and decision tree are combined to realize the privacy preserving of SNP data in the process of epistatsis detection. This section consists of dimensionality reduction, selection of a few important features (SNP loci), and epistasis detection in combination with differential privacy preserving.
Candidate feature selection by fusion strategy
Feature selection is prevalent in two-stage method to remove redundant and unrelated features. However, the previous filtering criteria are based on a single main effect. Some features (SNP loci) that have weak main effect but strong interaction might be pruned. Thus, the relief and mutual information are applied to score and sort the SNP loci, respectively. It tries to reserve the features with weak main effect but obvious interaction effect as much as possible. The importance scores of SNPs are merged to generate the candidate set of SNPs.
Relief algorithm (Kira and Rendell 1992) was first proposed by Kira et al. The features are assigned different weights W={w1,w2,...,wn} according to the correlation between the corresponding features and categories. A threshold δ can be specified by the data characteristics. If δ<wk, the feature is removed. The correlation is based on the ability of features to distinguish between the nearest distance samples. A random sample R is chosen from the training set D-train. It is used to find the nearest neighbor sample NH(called Near Hit) of the same class as R and the nearest neighbor sample NM(called Near Miss) of a different class from R. The weight of each feature is then updated according to the Eqs. 6 and 7. Note, only the discrete features are considered here.
$$ {w(k)}_{i+1}={w(k)}_{i}- \frac{diff (k,R,NH)} {m} + \frac{diff(k,R,NM)}{m} $$
(6)
$$ \text{diff}(k,{{R}_{1}},{{R}_{2}})=\left\{ \begin{array}{lll} 0 & if & {R}_{1}(k)={R}_{2}(k) \\ 1 & if & {R}_{1}(k)\neq {R}_{2}(k) \\ \end{array} \right. $$
(7)
If the distance between R and NH is less than the distance between R and NM on the k-th feature, it indicates that the feature is useful to distinguish the nearest neighborhood samples of different categories. Thus, a higher weight should be assigned to the feature. The above process is repeated m times, and eventually the average weight of each feature can be obtained. The greater the weight is, the stronger the classification ability of the feature.
Mutual information (Li et al. 2013) is used to measure the degree of association between two variables by scoring the correlation between random variables SNPs and disease. Mutual information is defined as:
$$ I(X,Y)=H(X)+H(Y)-H(X,Y) $$
(8)
where H(X) is the information entropy of X, and H(X,Y) is the joint entropy of random variables X and Y. X and Y represent different SNP locus and a disease state(i.e., control or case), respectively.
Let X={x1,x2,...,xn}. p(X=xi) denote the frequency of xi appearing in X. Thus, \(H(X) = - \sum \limits _{i=1}^{n} {p({{x}_{i}}) \log p({{x}_{i}})}\) is to measure the degree of uncertainty of the random variable X. p(xi) indicates the distribution frequency of different alleles at a SNP locus. I(X,Y) is the degree of association between the SNP locus X and the disease state Y. The larger the value, the higher the degree of association between X and Y.
Two feature importance scores W and I(I=I(X,Y)) are obtained by the aforementioned relief algorithm and mutual information. The initial scores are normalized to W′ and I′. They are summed by weights to obtain the final merged feature ranking score. The fusion is defined as
$$ Score={p}_{1} \cdot W^{\prime}+{p}_{2}\cdot I^{\prime} $$
(9)
where p1 and p2 are the weights of the two methods, and Score represents the final feature ranking score. The SNPs candidate set is decided by Score.
Decision tree based on differential privacy
The decision tree is constructed by selecting the combination of features (SNPs) from the derived candidate feature set. However, the counting information of the SNP data may lead to a risk of personal privacy breach. The differential privacy is thus merged into the construction of decision tree as below:
-
Add noise obeying the laplace distribution to the sample count of the data set;
-
Use the exponential mechanism to select splitting attributes from the attribute set;
-
Add the noise of the laplace distribution to the sample count of the split node. If the node satisfies the splitting termination condition, the noise is added to the sample count of the leaf node in the same way, and the class with the largest leaf node class count is retured. Otherwise go to the second step.
Algorithm 1 offers the pseudo-code of the decision tree algorithm based on differential privacy. D(i) and Dc represent samples at non-leaf nodes and leaf nodes, respectively. STC is the splitting termination condition, as follows:
-
The classification attribute of all records of the node are consistent;
-
Or reaches the depth h of the decision tree;
-
Or the allocated privacy budget ε is exhausted.
Pruning decision tree
To classify the training samples as accurately as possible in decision tree, some features unique to the training set are considered as general attributes of the data set, thereby over-fitting. In addition, it is no longer possible to identify a leaf with pure class values due to the application of extra noise in this paper. The splitting attribute will continue to split until the instances are insufficient and the depth constraint is not reached. It is thus important to trim the decision tree.
Some methods implement pruning by a validation set (mutually exclusive with the training set and the test data), such as the minimal cost complexity pruning and reduced error pruning. However, the validation set reduces the size of the training set. This would increase the size of relative noise in this paper. Therefore, the following formula 10 is used to trim the tree.
$$ H \left({D}_{i} \right) \ge {\underset{v\in a}{\sum}}\, \frac{|D_{i,{a_{v}}}|}{|D_{i}|} H \left(D_{i,{a_{v}}} \right) $$
(10)
where H is the information entropy and \(D_{i,a_{v}} \) is the leaf node. Information entropy calculates the average purity of all leaf nodes. They are compared to their parent nodes. If the above formula is satisfied, all leaf nodes of Di is deleted and Di become a new leaf node (Fletcher and Islam 2015).
Scoring Function
The scoring function of the exponential mechanism is also the splitting criterion of the decision tree. It directly determines the quality of the splitting attribute selection. In this paper, Information Gain and Max operator is chosen as the scoring function. d=|D| is the number of records in the data set, ra and rC represent the values of attribute a and class C, respectively. \(D_{j}^{a} = \{ r\in D:r_{a}=j\}\), \(d_{j}^{a} = \left | D_{j}^{a} \right |\), dc=|r∈D:rC=c|, dc=|r∈D:rC=c|.
Information Gain.
The greater the information gain, the simpler the decision tree and the higher the classification accuracy. The information entropy of the class attribute C is defined as \(H_{C}(D)=-{\underset {c\in C}{\sum }}\,\frac {d_{c}}{d} log \frac {d_{c}} {d}\), where dc and d are the number of records belonging to class c and the total number of records, respectively. If the sample set D is divided by using the attribute a, the obtained information gain is:
$$ \text{InfoGain}\left(D, a \right)={{H}_{C}}\left(D \right)-{{H}_{C|a}}\left(D \right) $$
(11)
where \({{H}_{C|a}}\left (D \right)={\underset {j\in a}{\sum }}\,\frac {d_{j}^{a}}{d}\cdot {{H}_{C}}\left (D_{j}^{a} \right)\) is the weighted sum of the information entropy of all subsets. Since the maximum of HC(D) is log|C| and the minimum of HC|a(D) is 0, the sensitivity Δq of q(D,a)=InfoGain(D,a) is equal to log|C|. Due to C={control,case} in the SNP data, so |C|=2, and Δq=log2=1.
Max Operator.
Max operator (Breiman et al. 1984) is used to select the class with the highest frequency as the score value of the corresponding node:
$$ \text{Max} \left(D,a \right)={\underset{j\in a}{\sum }}\,\left({\underset{c}{\max }}\,\left(d_{j,c}^{a} \right) \right) $$
(12)
According to the formula 12, the sensitivity Δq of q(D,a)=Max(D,a) is equal to 1.
Privacy analysis
We apply two composite properties of privacy budget:the sequential and the parallel composition (Mcsherry and Talwar 2007) to analyze privacy. The two lemmas are as follows:
Lemma 1
(Sequential Composition) Suppose each Gi provide ε-differential privacy. A sequence of G={G1,G2...,Gn} over the data set D privides (n·ε)-differential privacy.
Lemma 2
(Parallel Composition) Suppose each Gi provide εi-differential privacy. The parallel of G={G1,G2...,Gn} over a set of disjoint data sets Di will provides max{ε1,ε2,...,εn}-differential privacy.
Each layer of the decision tree is the same data set. According to the Lemma 1, the privacy budget assigned to each layer is E=ε/h. The splitting of nodes at each level is on disjoint data sets. According to Lemma 2, each node is assigned a privacy budget that is less than or equal to this layer’s privacy budget. Here, we assume that the privacy budget of each node is equal to the privacy budget of this layer. Then half of the privacy budget assigned to each node, \(\phantom {\dot {i}\!}{\varepsilon }^{'}=E/2= \varepsilon /2h\), is used to estimate the instance count of the node (adding Laplacian noise), and the other half of the privacy budget \(\phantom {\dot {i}\!}\left ({\varepsilon }^{'}=\varepsilon /2h \right)\) is used by the exponential mechanism to select the optimal splitting node or added Laplacian noise to the leaf node instance count. Consequently, the total privacy budget consumed by the algorithm is not greater than h∗(ε/2h+ε/2h)=ε. It satisfies ε-differential privacy.
Time complexity analysis
In order to generate a decision tree, we need to scan the entire data set D. Then use the exponential mechanism to select an attribute to split, the time complexity of this process is O(t|D|log|D|) (t is the number of attribute set). After the exponential mechanism selects the splitting attribute, the data set needs to be divided once. In the worst case, the entire data set needs to be scanned, and the time complexity is O(|D|). Since the decision tree depth is h, the time complexity of the algorithm is O(h|D|log|D|) under a certain number of attributes.