Understanding Forgery Properties of Spam Delivery Paths Fernando Sanchez Florida State University sanchez@cs.fsu.edu Zhenhai Duan Florida State University duan@cs.fsu.edu Yingfei Dong University of Hawaii yingfei@hawaii.edu ABSTRACT It is well known that spammers can forge the header of an email, in particular, the trace information carried in the Received: fields, as an attempt to hide the true origin of the email. Despite its critical importance for spam control and holding accountable the true originators of spam, there has been no systematic study on the forgery behavior of spammers. In this paper, we provide the first comprehensive study on the Received: header fields of spam emails to investigate, among others, to what degree spammers can and do forge the trace information of spam emails. Towards this goal, we perform empirical experiments based on two complementary real-world data sets: a 3 year spam archive with about 1.84 M spam emails, and the MX records of about 1.2 M network domains. In this paper, we report our findings and discuss the implications of the findings on various spam control efforts, including email sender authentication and spam filtering. 1. INTRODUCTION Due to the weak security design of the Simple Mail Trans- fer Protocol (SMTP) [13], spammers have immense power and flexibility in forging email headers to mislead email re- cipients about the real sender of a spam email and to hide the true origin of the email. To ease exposition, in this paper we refer to all categories of unwanted emails as spam emails (including, for example, spam, phishing emails, and email- based extortion and threats) and senders of these emails as spammers. The ability of spammers to forge email head- ers often complicates the spam control efforts and makes it hard to hold accountable the true spam originators. This presents a great challenge for law enforcement to properly investigate and prosecute email-based criminals [1]. On the other hand, despite its critical importance for spam control and holding accountable the true originators of spam, there has been no systemic study on the forgery behavior of spammers, except anecdotal evidence of spam header forgery. In this paper we provide the first comprehen- sive study on the forgery behavior of spammers. Given the importance of the trace information carried in the Received: header fields in the investigation of the true origin of a spam email, in this paper we concentrate our efforts on the Received: header fields of spam emails to investigate, among others, to what degree spammers can and do forge CEAS 2010 - Seventh annual Collaboration, Electronic messaging, Anti- Abuse and Spam Conference July 13-14, 2010, Redmond, Washington, US the trace information of spam emails [16]. Towards this goal, we perform empirical experiments based on two complementary real-world data sets. The first one is a 3-year spam archive from 2007 to 2009 [9], which con- tains about 1.84 M spam emails. We extract the Received: header fields of each spam email, and refer to the sequence of mail servers carried in these header fields as the spam delivery path of the email. The second data set is the mail exchanger (MX) records of about 1.2 M network domains. We use the study on the MX records of the network do- mains to interpret and confirm the main findings from the first data set. Our main findings (regarding the degree to which spammers can and do forge spam delivery path) are the following. The number of nodes on spam delivery paths is small and decreased over the 3 year time span. For example, con- sider the portion of path from the (claimed) origin to the first internal mail server of the recipient network, the av- erage number of nodes on the paths in 2007 is 2.57, which is decreased to 2.34 in 2008 and 2009. Moreover, consider the same portion of the paths, about 45%, 68%, 66% of spam emails have a path of only two hops in 2007, 2008, and 2009, respectively. That is, such emails were directly delivered from the spam originating machines to the recipi- ent side mail servers, without any attempt to fake the spam delivery paths. Such emails were likely sent from compro- mised machines or members of spamming botnets [22, 4]. Although it is tempting to argue that such spammers do not fake the trace information because they are not con- cerned with a spamming botnet member being identified, as we will discuss in Section 4, it is hard, if not impossible, for such spammers to hide the true origin even if they fake the trace information. Our investigation of the MX records of the 1.2 M network domains shows that the majority (90%) of domains only have mail servers in one domain, which means that the ma- jority of network domains on the Internet today do not need a third party to provide the backup relay service; emails des- tined to these domains should be directly delivered to their own mail servers. The trend of using mail servers in a single domain helps shorten the path that an email traverses from the sender domain to the recipient domain, and makes it hard for spammers to create forged but undetectable trace information. Our findings have important implications on a broad range of spam control efforts, including email sender authentica- tion schemes and spam filtering. It also helps guide the ef- forts of law enforcement on investigating email-based crimes.