Abstract—Email has become a fast and cheap means of online communication. The main threat to email is Unsolicited Bulk Email (UBE), commonly called spam email. The current work aims at identification of unigrams in more than 2700 UBE that advertise body-enhancement drugs. The identification is based on the requirement that the unigram is neither present in dictionary, nor is a slang term. The motives of the paper are many fold. This is an attempt to analyze spamming behaviour and employment of word- mutation technique. On the side-lines of the paper, we have attempted to better understand the spam, the slang and their inter- play. The problem has been addressed by employing Tokenization technique and Unigram BOW model. We found that the non-lexicon words constitute nearly 66% of total number of lexis of corpus whereas non-slang words constitute nearly 2.4% of non-lexicon words. Further, non-lexicon non-slang unigrams composed of 2 lexicon words, form more than 71% of the total number of such unigrams. To the best of our knowledge, this is the first attempt to analyze usage of non-lexicon non-slang unigrams in any kind of UBE. Keywords—Body Enhancement, Lexicon, Medicinal, Slang, Unigram, Unsolicited Bulk e-mail (UBE) I. INTRODUCTION ITH the increase in usage and availability of Internet, there has been a tremendous increase in usage of e-mail. It has proved to be an important medium of cheap and fast electronic communication. But the same thing that has increased its popularity as a communication medium has also proved to be a source of non-personal, non-time critical, multiple, similar and un-solicited messages received in bulk. This type of message is called Unsolicited Bulk e-mail (UBE) and is known by various other names like Spam e-mail, Junk e- mail and Unsolicited Commercial e-mail (UCE). The spread of UBE has posed not only technical problems but has also posed major socio-economic threats. Also, the definition of spam e-mail is ‘relative’ [4, 10, 20]. This means to say that all e-mails going to spam folder may not be spam for a person – same as all e-mails going to inbox may not be ham (i.e. non- J. R. Saini is with the Sankalchand Patel College of Engineering, Visnagar, Mehsana, Gujarat, India as Associate Professor and Head of Department of Computer Science. He is PhD from Veer Narmad South Gujarat University, Surat, Gujarat, India. (phone: +91-9426861815; e-mail: saini_expert@yahoo.com). A. A. Desai is with the Veer Narmad South Gujarat University, Surat, Gujarat, India as Professor and Head of Department of Computer Science. He is PhD from Veer Narmad South Gujarat University, Surat, Gujarat, India. (e-mail: desai_apu@hotmail.com). spam) e-mails. Further, all spam e-mail is not harmful; some is just annoying [2, 6, 16].UBE incidences range from fake job offers and viruses to pornography. Another area of concern is of spam e-mails that advertise the body enhancement medicinal products. The target areas of these products range from enhancement of male and female organs to loose or gain weight, improve hair growth, increase height and reduce blood-sugar. The dangerous thing about these emails is that they demand a handsome amount of money for delivery of the product, which is never delivered or in worst case a fake product is delivered. But due to the fear of society and feeling of embarrassment, the victim rarely comes out to declare of the way he/she was cheated through non-delivery or delivery of a fake product against a heavy payment of a so-called body enhancement medicine. Further, this kind of UBE mostly targets medicines or drugs like Viagra, Xanax and Phentrimine for the genitals and many times the advertising pharmacies include pictures and textual statements in the emails which are largely pornographic. Even though there are many target areas of such medicinal products as advertised and offered in the UBE, in general this paper refers to this kind of UBE as body enhancement medicinal UBE.In past, researchers have worked in direction of understanding the spam for combating it [9, 12, 26]. We also believe that first step in combating spam is to understand spam and the best way of understanding spam is to analyze it. Most importantly, spam can be differentiated by content [23] and in this paper we target content-based analysis of un-structured UBE documents which advertise fake medicines for body enhancements. This work aims towards identification of specific type of lexis occurring in such UBE. The basic structure of spam e-mail message is same as of ham e-mail, consisting of ‘header’ and ‘body’ parts. In this paper, we have treated spam e-mail as un-structured because in addition to consideration of contents of structured ‘header’ part, we propose content analysis of ‘body’ part also. The structure of ‘body’ part is not fixed with respect to number of words, lines, format, etc. and hence we treat UBE as an un- structured document. From a technical perspective, identification of non-lexicon non-slang unigrams in UBE documents is a Text Parsing and Tokenization task and we propose to solve it using Bag of Words (BOW) and Vector Space Document Model approach. The lexicon used by us for identification of lexicon words is English language dictionary. Further, we do not use dictionary of technical terms like legal terms, medicinal terms, etc. The present work treats all those Identification of Non-Lexicon Non-Slang Unigrams in Body-enhancement Medicinal UBE Jatinderkumar R. Saini, Apurva A. Desai W World Academy of Science, Engineering and Technology International Journal of Computer and Information Engineering Vol:5, No:8, 2011 973 International Scholarly and Scientific Research & Innovation 5(8) 2011 scholar.waset.org/1307-6892/15887 International Science Index, Computer and Information Engineering Vol:5, No:8, 2011 waset.org/Publication/15887