IACSIT International Journal of Engineering and Technology, Vol.3, No.2, April 2011 Abstract—The continuing explosive growth of textual content within the World Wide Web has given rise to the need for sophisticated Text Classification (TC) techniques that combine efficiency with high quality of results. E-mail filtering and email organization is an application rife with the potential to streamline the management of the vast amount of information that accumulates in the inbox. Even though a large body of research has delved into this area, there is a paucity of survey that indicates trends and directions. This paper attempts to categorize the prevalent popular techniques for classifying email as spam or legitimate and suggests possible techniques to fill in the lacunae in the arena of automatic management of emails. Our findings suggest that context-based email organization has the most potential in improving quality by learning various contexts such as n-gram phrases, linguistic constructs or users’ profile based context to tailor his/her filtering scheme. Index terms—Context Based TC, Context Interpretation, E- mail Management, Statistical TC I. INTRODUCTION Text Classification (TC) is the task of automatically sorting a set of documents into categories such as topics from a predefined set. The task falls at the crossroads of information retrieval (IR) and Machine Learning (ML). It has witnessed a booming interest in the last ten years from researchers and developers alike due to its ever-expanding horizon of applications such as document classification, text summarization, essay scoring and user-specific presentation of textual material [1]. The Email affects every user of the Internet. However, emails also bloat and flood the inbox quickly leading to a morass of unorganized information. Even though many email providers allow the creation of folders and sub-folders where emails can be routed based on sender’s address, date, subject etc. the whole process is largely manual. There is an urgent need for automatically segregating emails based on their relevance to the user. As a basic need, spam filtering classifies messages into two categories, viz. spam and non-spam. Besides being undesired, spam email consumes a lot of network bandwidth. This is not a typical TC application. Over time, spammers resort to deceptive and deluging methods to get around antispam software thereby leading to a gradual degeneration of the filter’s efficacy. To counter this, innovative TC approaches with good generalization, continuous adaptive learning and context sensitivity need to be applied. Extending this concept to the general case of Upasana Pandey, (e-mail:upasana1978@gmail.com) S. Chakraverty, (e-mail:apmahs@rediffmail.com) Division of Computer Engineering ,Netaji Subhas Inst. of Technology, New Delhi-110078 filtering emails into several categories based on their relevance to the user, we can investigate TC approaches for personalized management of all emails. Predominantly, statistical approaches have been applied for text classification. These approaches are based on the word occurrences i.e. frequency of one or more words in a given document. Several algorithms based on this method have been reported and have given good results in web applications [2-4, 6-15]. An alternative approach is Context based text classification that takes into account how a word w1 influences the occurrence of another word w2 in the document. Thus, the presence or absence of w1 affects a classification based on w2. Even though some recent papers [16, 19] have reported techniques and algorithms for finding relevancy among words, significant work has not yet been carried out in the field of context based text classification for email applications. In this paper we present a survey focusing on statistical as well as some recent context based approaches for TC with focus on spam filtering and email applications. Performance Measures: The following parameters are important performance indices for spam filtering. A false positive is result that classifies a legitimate email as a spam email. A false negative is a result that classifies a spam email as a legitimate email. A False-positive error that diverts a legitimate email as spam is generally considered more serious than a False-negative. Now, out of all the spam emails, let a numbers of them be categorized correctly as spam (true positives) and the remaining b be categorized as legitimate (false negatives). Likewise, out of all legitimate emails, let c of them be erroneously categorized as spam (false positives) and remaining d be categorized as legitimate (true negatives). Let N be the sum total of a, b, c, d. The following scores are defined: 1. Spam recall RE= a/ (a+b) 2. Spam precision PR= a/ (a+c) 3. F1-score: This is defined as 2(1/RE + 1/PR) -1 4. Accuracy AC=(a+d)/N 5. Error ER=(b+c)/N Macroaveraged results use a simple average of the above scores across all categories, giving equal weight to each category. Microaveraged results on the other hand first aggregate the, b, c, d and N values for all categories and then compute the above scores. Since all documents are given equal weight, this results in more frequently occurring categories being given greater weight. II. STATISTICAL APPROACHES A. Naïve Bayes Classifier A naïve Bayes classifier applies Bayesian statistics with strong independence assumptions on the features that drive A Review of Text Classification Approaches for E-mail Management Upasana Pandey and S. Chakraverty 137