DAG: a general model for privacy-preserving data mining

2017-03-02T01:46:30Z (GMT) by Teo, Sin Gee
Rapid advances in automated data collection tools and data storage technology has led to the wide availability of huge amount of distributed data owned by different parties. Data mining can use the distributed data to discover rules, patterns or knowledge that are normally not discovered data owned by a single party. Thus, data mining on the distributed data can lead to new insights and economic advantages. However, in recent years, privacy laws have been enacted to protect any individual sensitive information from privacy violation and misuse. To address the issue, many have proposed privacy-preserving data mining (PPDM) based on secure multi-party computation (SMC) that can mine the distributed data with privacy preservation (i.e., privacy protection). However, most SMC-based solutions are ad-hoc. They are proposed for specific applications, and thus cannot be applied to other applications directly. Another limitation of current PPDM is with only a limited set of operators such as +,−,× and log (logarithm). In data mining primitives, some functions can involve operators such as / and √ . The above issues have motivated us to investigate a general SMC-based solution to solve the current limitations of PPDM. In this thesis, we propose a general model for privacy-preserving data mining, namely as DAG. We apply a hybrid model that combines the homomorphic encryption protocol and the circuit approach in DAG model. The hybrid model has been proven efficient in computation and effective in protecting data privacy via the theoretical and experimental proofs. Specifically, our proposed research objectives are as follows: (i) We want to propose a general model of privacy-preserving data mining (i.e., DAG) that consists of a set of secure operators. The secure operators can support many mining primitives. The two-party model which is the efficient and effective model is applied to develop secure protocols in DAG. Our secure operators can provide a complete privacy under the semi-honest model. Moreover, the secure operators are efficient in computation. (ii) We will integrate DAG model into various classification problems by proposing new privacy-preserving classification algorithms. (iii) To make our DAG model that can support wider applications, we will integrate DAG into other application domains. We will integrate DAG into ant colony optimization (ACO) to solve the traveling salesman problem (TSP) by proposing a new privacypreserving traveling salesman problem (PPTSP). In this report, we present most results of the objectives mentioned above. The DAG model is general – its operators, if pipelined together, can implement various functions. It is also extendable – new secure operators can be defined to expand the functions the model supports. All secure operators of DAG are strictly proven secure via simulation paradigm (Goldreich, 2004). In addition, the error bounds and the complexities of the secure operators are derived so as to investigate accuracy and computation performance of our DAG model. We apply our DAG model into various application domains. We first apply DAG into data mining classification algorithms such as support vector machine, kernel regression, and Na¨ıve Bayes. Experiment results show that DAG generates outputs that are almost the same as those by non-private setting, where multiple parties simply disclose their data. DAG is also efficient in computation of data mining tasks. For example, in kernel regression, when training data size is 683,093, one prediction in non-private setting takes 5.93 sec, and that by our DAG model takes 12.38 sec. In the experiment of PPTSP, a salesman can find the approximate optimal traveled distance without disclosing any city locations in TSP. Various domain applications studies show that our DAG is the general model yet efficient for secure multi-party computation.