Email Alias Detection Using Social Network Analysis Ralf H ¨ olzer Information Networking Institute Carnegie Mellon University Pittsburgh, PA 15213 rholzer@cmu.edu Bradley Malin School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 malin@cs.cmu.edu Latanya Sweeney School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 latanya@privacy.cs.cmu.edu ABSTRACT This research addresses the problem of correctly relating aliases that belong to the same entity. Previous approaches focused on natural language processing and structured data, whereas in this research we analyze the local association, or “social” network in which aliases reside. The network is constructed from email data mined from the Internet. Links in the network represent web pages on which two email addresses are collocated. The problem is de- fined as given social network S, constructed from email address collocations, and an email address E, identify any aliases for E that also appear in S. The alias detection methods are evaluated on a data set of over 14,000 University X email addresses for which ground truth relations are known. The results are reported as par- tial lists of k choices for possible aliases, ranked by predicted rela- tional strength within the network. Given a source email address, a portion of all email addresses, 2%, are correctly linked to an- other alias that corresponds to the same entity by best rank, which is significantly better than random (0.007%) and a geodesic dis- tance (1%) baseline prediction. Correct linkages increase to 15% and 30% within top-10 (0.07% of all emails) and top-100 rank lists (0.7% of all emails), respectively. 1. INTRODUCTION Individuals on the Internet use aliases for various communication purposes. Aliases can be tailored to specific scenarios, which al- lows individuals to assume different aliases depending on the con- text of interaction. For example, many online users utilize aliases as pseudonyms in order to protect their true identity, such that one alias is used for web forum postings and another for e-mail corre- spondence. Determining when multiple aliases correspond to the same entity, or alias detection, is useful to a variety of both legiti- mate and illegitimate applications. Regardless of the intent behind alias detection, it is important to understand the extent to which the process can be automated. When aliases are listed on the same webpage it can indicate there exists some form of relationship between them. In order to lever- age this relationship, we analyze several methods for alias detection based on social network analysis [19]. Social network analysis has Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. LinkKDD 2005, August 21, 2005, Chicago, Illinois, USA. Copyright 2005 ACM 1-59593-135-X/05/0008 ...$5.00. recently been integrated into the computer science community to model several problems, including record linkage in co-authorship networks [4] and name disambiguation [13]. We assume the net- work in which aliases, extracted from webpages, are situated reveal certain aspects of the social network to whom the alias corresponds. Since many people use several email addresses for related pur- poses, we attempt to determine which email addresses correspond to the same entity by analyzing the relational network of addresses extracted from webpages. Email addresses, a type of alias, can be distilled from a large number of web pages, such as class ros- ters [18], research papers [5], resumes [12], discussion boards, or USENET message archives [7]. For this paper networks are con- structed from email addresses extracted from web pages within a specific university’s system. As a result, similarities in the local network surrounding each address can be exploited to determine which aliases correspond to the same entity. Furthermore, email addresses provide another useful property for determining relation- ships. In contrast to other identifiers, email addresses provide a unique mapping from address to a specific entity. Thus, no disam- biguation is necessary when studying email addresses as identifiers for alias detection. The remainder of this paper is organized as follows. Section 2 reviews earlier approaches to alias detection and determining im- portance between nodes in social networks. Novel methods based for alias detection are discussed in section 3. In addition, the graph representation of the network and the ranking algorithms are in- troduced. In section 4, the detection methods are evaluated on a dataset for which a large number of email aliases are known. Re- sults and limitations of the approaches are discussed in section 5. 2. RELATED RESEARCH Alias detection is related to the problem of alias disambiguation. The latter attempts to determine if the same alias, such as “John Smith”, refers to one or multiple entities. There are certain simi- larities between the disambiguation and detection, and as a result, some of the methods and insights garnered from one can be ap- plied to the other. In this section we review several approaches which have been applied to the disambiguation and detection prob- lems. The approach of choice depends primarily on the type of underlying data to be analyzed. Natural language processing has been successfully applied to identify whether separate writings have been authored by the same individual. Computational and statistical models were first pro- posed by Mosteller and Wallace [14] to solve disputes regarding the authorship of free text documents. Their models were extended by Rao et al. [17] who applied techniques from linguistics and sty- lometry to identify pseudonyms in a textual context on the Internet. These methods were successful in identifying aliases used by the