International Journal of Computer Applications (0975 – 8887) Volume 41– No.7, March 2012 7 Detection of Fraudulent Emails by Authorship Extraction A. Pandian Department of MCA SRM University, Chennai, India ABSTRACT Fraudulent emails can be detected by extraction of authorship information from the contents of emails. This paper presents information extraction based on unique words from the emails. These unique words will be used as representative features to train Radial Basis function (RBF). Final weights are obtained and subsequently used for testing. The percentage of identification of email authorship depends upon number of RBF centers and the type of functional words used for training RBF. One hundred and fifty authors with over one hundred files from the sent folder of Enron email dataset are considered. A total of 300 unique words of number of characters in each word ranging from three to seven are considered. Training and testing of RBF are done by taking different lengths of words. Our simulation shows the effectiveness of the proposed RBF network for email authorship identification. The accuracy of authorship identification ranges from 95% to 97%. Keywords: email authorship identification, spam, word frequency, radial basis function 1. INTRODUCTION As the volumes of emails on the net increases, spam and hoax mails have to be detected. The principal objective of author identification is to classify [Koppel et al, 2002] the emails belonging to an author. This approach is used in forensic for author identification in malicious emails. Certain commercial software such as AntConc, Copy Catch Gold, Lexico3, Signature Stylometric System, T-lab, Yoshikoder, and WordSmith 2 Department of Information Technology Tools use statistical methods to identify an author. These systems use parameters such as the number of unique words, number of content words used in the list, total number of words in the text or vocabulary items used, vocabulary richness, mean sentence length, mean paragraph length, mean of 2-3 letter words, mean of words starting with vowels, cumulative summation method, and bigrams. The users who intend to utilize the software for their email author identification need to choose the type of statistical analysis options that best identify author of an email and obtain the characteristics that remain constant for large number of emails written by the author. Each author follows a certain style, which is based on functional words. By using these functional words and their frequencies, identification of the author is possible [Madigan et al, 2005]. Mohamed Abdul Karim College of Applied Sciences, Sohar, Ministry of Higher Education, OMAN 2. RELATED WORK By and large, research has focused on different aspects of text. There are two different properties of the texts that are used in classification: the content of the text and the style of the author. Stylometry [Goodman 2007] is the statistical analysis of literary. Style complements traditional literary scholarship since it offers a means of capturing the often elusive character of an author’s style [Zheng 2006] by quantifying some of its features. Most stylometry [Pavelec et al. 2007] [Diederich and Chen 2008] studies employ items of language and most of these are lexically based. The usefulness of function words in Authorship attribution has been examined [Diederich et al. 2003]. Experiments were conducted with support vector machine classifiers in twenty novels and success rates above 90% were obtained. The use of functional words is a valid and good approach in attribution of authorship [Koppel 2006]. A success rate of 65% and 72% has been measured in the study for authorship recognition, which is an implementation of multiple regression and discriminant analysis [Stamatatos et al, 2000]. Concurrently experiments conducted with support vector classifiers [Diederich et al. 2003] detected authors with 60-80% success rates using different parameters. The effect of word sequences in authorship [Abbasi 2005] attribution has been studied. The researchers aimed to consider both stylistic and topic features of texts. In this work, the documents are identified by the set of word sequences that combine functional and content words. The experiments are conducted on a dataset consisting of poems using naïve Bayes classifier [Peng et al, 2004]. Later authorship studies (Farkhund Iqbal 2010) contain lexical, syntactic, structural and content-specific features. Lexical features are used to learn about the preferred use of isolated characters and words of an individual. Word-based features including word length distribution, words per sentence, and vocabulary richness were very effective. 3. APPROACH OF INFORMATION EXTRACTION Different types of words are used for filtering and as templates. Words indicating work, action, different categories of prepositions, pronouns, adjectives, adverbs, conjunctions and interjections are listed in Table 1. While analyzing an email for uniqueness, the extracted features are categorized based on the list of words presented. Hence, unnecessary words are eliminated and the number of unique words that represent an email is minimal.