IM Session Identiﬁcation by Outlier Detection in Cross-correlation Functions Saad Saleh ∗ , Muhammad U. Ilyas ∗ , Khawar Khurshid ∗ , Alex X. Liu ‡ and Hayder Radha § ∗ Dept. of Electrical Engineering, School of Electrical Engineering and Computer Science, National University of Sciences and Technology, H-12, Islamabad – 44000, Pakistan ‡ Dept. of Computer Science and Engg, College of Engineering, Michigan State University, East Lansing, MI – 48824, USA § Dept. of Electrical and Computer Engg, College of Engineering, Michigan State University, East Lansing, MI – 48824, USA Email: {saad.saleh, usman.ilyas, khawar.khurshid}@seecs.edu.pk ∗ , alexliu@cse.msu.edu ‡ , radha@egr.msu.edu § Abstract—The identiﬁcation of encrypted Instant Messaging (IM) channels between users is made difﬁcult by the presence of variable and high levels of uncorrelated background trafﬁc. In this paper, we propose a novel Cross-correlation Outlier Detector (CCOD) to identify communicating end-users in a large group of users. Our technique uses trafﬁc ﬂow traces between individual users and IM service provider’s data center. We evaluate the CCOD on a data set of Yahoo! IM trafﬁc traces with an average SNR of −6.11dB (data set includes ground truth). Results show that our technique provides 88% true positives (TP) rate, 3% false positives (FP) rate and 96% ROC area. Performance of the previous correlation-based schemes on the same data set was limited to 63% TP rate, 4% FP rate and 85% ROC area. Keywords- Link de-anonymization; instant messaging; se- curity; privacy; I. I NTRODUCTION A. Background and Motivation Instant Messaging (IM) services are projected to reach 1.4 billion users worldwide by 2016 [1]. IM services provide mostly free, ubiquitous access, mobility and privacy. However, user privacy has been under attack for unlawful reasons in the past few years, e.g. the theft of data of 35 million users in Korea in 2013 [2]. Government agencies, including the National Security Agency (NSA), have also breached the privacy of millions of users [3]. The aim of our research is to assess the vulnerability of IM sessions to de-anonymization attacks (identifying who is communicating with whom) using only transport layer session traces. Such link de-anonymization is challenging for the following reasons: (1) IM messages are now often times encrypted (only IP and TCP headers are visible which become infeasible to log on a large scale), (2) IM data center establishes separate TCP connections between the source and destination users (at any time, no packet contains the source and destination IPs of both end users). The complexity of de-anonymization increases further in the following practical scenarios, (1) Simultaneous multiple mes- sage sessions by a user, (2) Thousands of users communicating through IM data center at any time, (3) Duplicate packets due to retransmissions and (4) Out-of-order packet delivery. B. Limitations of Prior Art Several prior works have focused on link de- anonymization. Time Series Correlation (TSC), the baseline approach, has a TP rate of 63% for a signal-to-noise ratio (SNR) of −6.11dB [4]. Major factors for performance deterioration include delay, jitter, buffering, reordering, duplicate messages and server messages. In the area of de-anonymization of mix-networks several studies focused on the computation of mutual information between ingress and egress trafﬁc ﬂows. High time-complexity and reliance on data that needs to be collected from multiple points inside the network are major limitations. In social-network de-anonymization, data sparsity and membership information is used to de-anonymize networks. Here, the requirement of detailed user information becomes a major limiting factor. Several de-anonymization attempts have been made over Tor network but the major emphasis was the identiﬁcation of trafﬁc using various ﬁngerprints. Use of correlation attempts has been limited in breaching attempts. In our previous works, we showed that the correlation of wavelet decomposed time series of users’ trafﬁc traces can successfully breach IM session privacy [4]. In a recent study, we showed that the cause-effect relationship between packets appearing in two communicating (“talking”) users’ trafﬁc traces can be leveraged to de-anonymize user sessions [5]. Time complexity was a major limiting factor for these approaches. C. Proposed Approach In this paper, we propose a novel Cross-correlation Outlier Detector (CCOD) to de-anonymize users’ IM sessions using only undirected transport layer trafﬁc traces collected at the data center (or gateway), in the form of ﬂow logs. Our idea leverages the limited delay between the appearance of a packet in a sending user’s trafﬁc trace and its appearance in the receiving user’s trafﬁc trace. We expect talking users to have a high cross-correlation statistic, while for non-talking users the appearance of packets in the same time slot is expected to be coincidental. To de-anonymize a user we compute the cross-correlation function of the time series of her trafﬁc trace derived from her ﬂow log with the respective time series of all other users. Therefore, the time-complexity of our approach is Θ(N ), where N is the number of IM users. Next, we estimate the distribution of the cross-correlation function at all non-zero time-shifts. Finally, we apply a binary classiﬁer to the value of the cross-correlation function at zero time shift and determine