Understanding Forgery Properties of Spam Delivery Paths Fernando Sanchez Florida State University sanchez@cs.fsu.edu Zhenhai Duan Florida State University duan@cs.fsu.edu Yingfei Dong University of Hawaii yingfei@hawaii.edu ABSTRACT It is well known that spammers can forge the header of an email, in particular, the trace information carried in the Received: ﬁelds, as an attempt to hide the true origin of the email. Despite its critical importance for spam control and holding accountable the true originators of spam, there has been no systematic study on the forgery behavior of spammers. In this paper, we provide the ﬁrst comprehensive study on the Received: header ﬁelds of spam emails to investigate, among others, to what degree spammers can and do forge the trace information of spam emails. Towards this goal, we perform empirical experiments based on two complementary real-world data sets: a 3 year spam archive with about 1.84 M spam emails, and the MX records of about 1.2 M network domains. In this paper, we report our ﬁndings and discuss the implications of the ﬁndings on various spam control eﬀorts, including email sender authentication and spam ﬁltering. 1. INTRODUCTION Due to the weak security design of the Simple Mail Trans- fer Protocol (SMTP) [13], spammers have immense power and ﬂexibility in forging email headers to mislead email re- cipients about the real sender of a spam email and to hide the true origin of the email. To ease exposition, in this paper we refer to all categories of unwanted emails as spam emails (including, for example, spam, phishing emails, and email- based extortion and threats) and senders of these emails as spammers. The ability of spammers to forge email head- ers often complicates the spam control eﬀorts and makes it hard to hold accountable the true spam originators. This presents a great challenge for law enforcement to properly investigate and prosecute email-based criminals [1]. On the other hand, despite its critical importance for spam control and holding accountable the true originators of spam, there has been no systemic study on the forgery behavior of spammers, except anecdotal evidence of spam header forgery. In this paper we provide the ﬁrst comprehen- sive study on the forgery behavior of spammers. Given the importance of the trace information carried in the Received: header ﬁelds in the investigation of the true origin of a spam email, in this paper we concentrate our eﬀorts on the Received: header ﬁelds of spam emails to investigate, among others, to what degree spammers can and do forge CEAS 2010 - Seventh annual Collaboration, Electronic messaging, Anti- Abuse and Spam Conference July 13-14, 2010, Redmond, Washington, US the trace information of spam emails [16]. Towards this goal, we perform empirical experiments based on two complementary real-world data sets. The ﬁrst one is a 3-year spam archive from 2007 to 2009 [9], which con- tains about 1.84 M spam emails. We extract the Received: header ﬁelds of each spam email, and refer to the sequence of mail servers carried in these header ﬁelds as the spam delivery path of the email. The second data set is the mail exchanger (MX) records of about 1.2 M network domains. We use the study on the MX records of the network do- mains to interpret and conﬁrm the main ﬁndings from the ﬁrst data set. Our main ﬁndings (regarding the degree to which spammers can and do forge spam delivery path) are the following. The number of nodes on spam delivery paths is small and decreased over the 3 year time span. For example, con- sider the portion of path from the (claimed) origin to the ﬁrst internal mail server of the recipient network, the av- erage number of nodes on the paths in 2007 is 2.57, which is decreased to 2.34 in 2008 and 2009. Moreover, consider the same portion of the paths, about 45%, 68%, 66% of spam emails have a path of only two hops in 2007, 2008, and 2009, respectively. That is, such emails were directly delivered from the spam originating machines to the recipi- ent side mail servers, without any attempt to fake the spam delivery paths. Such emails were likely sent from compro- mised machines or members of spamming botnets [22, 4]. Although it is tempting to argue that such spammers do not fake the trace information because they are not con- cerned with a spamming botnet member being identiﬁed, as we will discuss in Section 4, it is hard, if not impossible, for such spammers to hide the true origin even if they fake the trace information. Our investigation of the MX records of the 1.2 M network domains shows that the majority (90%) of domains only have mail servers in one domain, which means that the ma- jority of network domains on the Internet today do not need a third party to provide the backup relay service; emails des- tined to these domains should be directly delivered to their own mail servers. The trend of using mail servers in a single domain helps shorten the path that an email traverses from the sender domain to the recipient domain, and makes it hard for spammers to create forged but undetectable trace information. Our ﬁndings have important implications on a broad range of spam control eﬀorts, including email sender authentica- tion schemes and spam ﬁltering. It also helps guide the ef- forts of law enforcement on investigating email-based crimes.