Private Two-Party Cluster Analysis Made Formal & Scalable Xianrui Meng Amazon Web Services, Inc. Alina Oprea Northeastern University Dimitrios Papadopoulos HKUST Nikos Triandopoulos Stevens Inst. of Technology ABSTRACT Machine Learning (ML) is widely used for predictive tasks in numerous important applications—most successfully, in the context of collaborative learning, where a plurality of entities contribute their own datasets to jointly deduce global ML models. Despite its efficacy, this new learning paradigm fails to encompass critical application domains, such as health- care and security analytics, that involve learning over highly sensitive data, wherein privacy risks limit entities to individ- ually deduce local models using solely their own datasets. In this work, we present the first comprehensive study for privacy-preserving collaborative hierarchical clustering, over- all featuring scalable cryptographic protocols that allow two parties to safely perform cluster analysis over their combined sensitive datasets. For this problem at hand, we introduce a formal security notion that achieves the required balance between intended accuracy and privacy, and present a class of two-party hierarchical clustering protocols that guaran- tee strong privacy protection, provable in our new security model. Crucially, our solution employs modular design and judicious use of cryptography to achieve high degrees of effi- ciency and extensibility. Specifically, we extend our core pro- tocol to obtain two secure variants that significantly improve performance, an optimized variant for single-linkage cluster- ing and a scalable approximate variant. Finally, we provide a prototype implementation of our approach and experimen- tally evaluate its feasibility and efficiency on synthetic and real datasets, obtaining encouraging results. For example, end-to-end execution of our secure approximate protocol, over 1M 10-dimensional records, completes in 35sec, trans- ferring only 896KB and achieving 97.09% accuracy. 1. INTRODUCTION Big-data analytics is an ubiquitous practice with a notice- able impact on our lives. Our digital interactions produce massive amounts of data that are processed in order to dis- cover unknown patterns or correlations, which, in turn, are used to draw safe conclusions or make informed decisions. At the core of this lies Machine Learning (ML) for devising complex data models and predictive algorithms that provide hidden insights and automated actions, while optimizing certain objectives. Example applications successfully em- ploying ML frameworks include, among others, market fore- cast, service personalization, speech/face recognition, au- tonomous driving, health diagnostics and security analytics. Of course, data analysis is only as good as the analyzed data, but this goes beyond the need to properly inspect, cleanse or transform high-fidelity data prior to its modeling. In most learning domains, analyzing “big data” is of twofold semantics: volume and variety. First, the larger the dataset available to an ML algorithm, the better its learning accuracy, as irregularities due to outliers fade faster away. Indeed, scalability to large dataset sizes is very important, especially so in unsupervised learning, where model infer- ence uses unlabelled observations (evading points of satura- tion, encountered in supervised learning, where new training sets improve accuracy only marginally). Also, the more var- ied the collected data, the more elaborate its analysis, as degradation due to noise reduces and domain coverage in- creases. Indeed, for a given learning objective, say classifica- tion or anomaly detection, combining more datasets of simi- lar type but different origin, enables discovery of more com- plex, or interesting, hidden structures and of richer associa- tion rules (correlation or causality) among data attributes. So, ML models improve their predictive power when they are globally built over multiple datasets owned and con- tributed by different entities, in what is termed collaborative learning —and widely considered as the golden standard [78]. Privacy-preserving hierarchical clustering. Several learning tasks of interest, across a variety of application do- mains, such as healthcare or security analytics, demand de- riving accurate ML models over highly sensitive data—e.g., personal, proprietary, customer, or other types of data that induce liability risks. By default, since collaborative learn- ing inherently implies some form of data sharing, entities in possession of such confidential datasets are left with no other option than simply running their own local models, severely impacting the efficacy of the learning task at hand. Thus, in the context of data analytics, privacy risks are the main impediment to collaboratively learning richer models over large volumes of varied, individually contributed, data. The security and database community has recently em- braced the powerful concept of Privacy-preserving Collabo- rative Learning (PCL), the premise being that effective ana- lytics over sensitive data is feasible by building global models in ways that protect privacy. This is typically achieved by applying Secure Multiparty Computation (MPC) or Differ- ential Privacy (DP) to data analytics, so that learning occurs over encrypted or sanitized data. 1 One notable example is the recent framework for privacy-preserving federated learn- ing [14], where model parameters are aggregated and shared by multiple clients to generate a global model. Existing work on PCL almost exclusively addresses supervised rather than 1 Yet, neither approach is directly applicable to large-scale PCL: Run- ning crypto-heavy ML tasks over large datasets impairs scalability and learning over sanitized data may allow leakage (e.g., [34,45,76]). 1 arXiv:1904.04475v2 [cs.CR] 28 Oct 2019