Privacy-Preserving Inference in Machine Learning Services Using Trusted Execution Environments Krishna Giri Narra Electrical and Computer Engineering University Of Southern California LA, CA, USA narra@usc.edu Zhifeng Lin Electrical and Computer Engineering University Of Southern California LA, CA, USA Yongqin Wang Electrical and Computer Engineering University Of Southern California LA, CA, USA Keshav Balasubramaniam Electrical and Computer Engineering University Of Southern California LA, CA, USA Murali Annavaram Electrical and Computer Engineering University Of Southern California LA, CA, USA Abstract—Preserving the privacy of user input sent to a cloud- based machine learning inference service is a critical need. One approach for private inference is to run the trained model within a secure hardware enclave. The user then sends encrypted data into the enclave, which is then decrypted within the enclave before running the inference entirely within the enclave. Secure enclaves, like Intel SGX, however, impose several restrictions. First, enclaves can only access limited memory without relying on expensive paging, thereby limiting the size of the model than can be run efﬁciently. Second, the increasing use of accelerators like GPUs and TPUs for inference will be curtailed in this mode of execution as accelerators currently do not provide enclaves. To tackle these challenges, this work presents Origami, which provides privacy-preserving inference for large deep neural network (DNN) models through a combination of enclave execu- tion, cryptographic blinding, interspersed with accelerator-based computation. Origami partitions the ML model into multiple partitions. The ﬁrst partition receives the encrypted user input within an SGX enclave. The enclave decrypts the input and then applies cryptographic blinding to the input data and the model parameters. Cryptographic blinding is a technique that adds noise to obfuscate data. Origami sends the obfuscated data for computation to an untrusted GPU/CPU. The blinding and de-blinding factors are kept private by the SGX enclave, thereby preventing any adversary from denoising the data, when the computation is ofﬂoaded to a GPU/CPU. The computed output is returned to the enclave, which decodes the computation on noisy data using the unblinding factors privately stored within SGX. This process may be repeated for each DNN layer, as has been done in prior work Slalom. However, the overhead of blinding and unblinding the data is a limiting factor to scalability. Origami relies on the empirical observation that the feature maps after the ﬁrst several layers can not be used, even by a powerful conditional GAN adversary to reconstruct input. Hence, Origami dynamically switches to executing the rest of the DNN layers directly on an accelerator without needing any further cryptographic blinding intervention to preserve privacy. We empirically demonstrate that using Origami, a conditional GAN adversary, even with an unlimited inference budget, cannot reconstruct the input. We implement and demonstrate the performance gains of Origami using the VGG-16 and VGG-19 models. Compared to running the entire VGG-19 model within SGX, Origami inference improves the performance of private inference from 11x while using Slalom to 15.1x. I. I NTRODUCTION Deep learning (DL) has made signiﬁcant strides possible in computer vision, machine translation, robotics, healthcare, etc. Training on large volumes of data is necessary to make supervised deep learning models accurate. As a consequence, the development of DL models happens primarily at organi- zations that have access to large data sets. After training, the trained DL models may be deployed in the cloud to serve user requests. With the rise in MLaaS (Machine Learning as a Service) offerings by cloud vendors like Amazon AWS, Microsoft Azure and Google cloud, organizations may deploy pre-trained models in the cloud. Users need to send their data such as images and text for inference. While running in the cloud the DL models can be exposed to a wide attack surface consisting of malicious users, compromised hypervisors and physical snooping, leading to data leakage. Users expect the service providers to protect the conﬁdentiality of their data. It is the responsibility of the service providers to meet the user expectations and not compromise privacy of user data accidentally or otherwise. Regulations like GDPR are attempting to enforce this requirement on all organizations handling private user data. One approach to protect conﬁdentiality of data is using cryptographic deep learning models [11], [23]. These DL models can process encrypted user data and as a result cannot leak conﬁdential user data. The main limitation with this approach is that it can take orders of magnitude more processing time than non-cryptographic DL models, which is a severe limitation in user facing services. Another technique to protect conﬁdentiality is leveraging Trusted Execution Environments (TEE) like Intel SGX [26], ARM TrustZone [1] or Sanctum [7]. Classic techniques like data encryption can protect the data during it’s storage and communication phases. TEEs complement these protections by protecting the data during computation phase. TEEs achieve arXiv:1912.03485v1 [cs.LG] 7 Dec 2019