International Journal of Engineering Research and Development ISSN: 2278-067X, Volume 1, Issue 9 (June 2012), PP.09-14 www.ijerd.com 9 Multi Label Text Classification through Label Propagation Shweta C. Dharmadhikari 1 , Maya Ingle 2 , Parag Kulkarni 3 1,3 Pune Institute of Computer Technology - EkLat Solutions, Pune , Maharashtra, India 2 Devi Ahilya Vishwa Vidyalaya, Indore, Madhya Pradesh , India Abstract: Classifying text data has been an active area of research for a long time. Text document is multifaceted object and often inherently ambiguous by nature. Multi-label learning deals with such ambiguous object. Classification of such ambiguous text objects often makes task of classifier difficult while assigning relevant classes to input document. Traditional single label and multi class text classification paradigms cannot efficiently classify such multifaceted text corpus. Through our paper we are proposing a novel label propagation approach based on semi supervised learning for Multi Label Text Classification. Our proposed approach models the relationship between class labels and also effectively represents input text documents. We are using semi supervised learning technique for effective utilization of labeled and unlabeled data for classification .Our proposed approach promises better classification accuracy and handling of complexity and elaborated on the basis of standard datasets such as Enron, Slashdot and Bibtex. Keywords: Label propagation , semi-supervised learning , multi-label text classification. I. INTRODUCTION The amount of textual data being produced through internet is growing faster than the ability of information consumers to search, digest and use it. Textual data is difficult to effectively understand and categorize because the relationship between its sequence of words and its content is less clear as compared to numerical. Such data includes technical article, memos, manuals, electronic mail, books, online news paper, journal articles and many other forms of texts. Thus text classification has become an active research topic now a day. It classifies document under a predefined category. Categories may be represented numerically or using single word or phrase or words with senses, etc. In traditional approach, classification of text was carried out manually using domain experts. The human expert was required to read and sort the input text document to predefined category or set of categories. Thus this approach requires extensive human efforts and error prone also. This leads to the scheme of automated text classification scenario. This automated text document classification facilitates ease of storage, searching, retrieval of relevant text documents or its contents for the needy applications. Three different paradigm exists under text classification and they are single label(Binary) , multiclass and multi label. Under single label a new text document belongs to exactly one of two given classes, in multi-class case a new text document belongs to just one class of a set of m classes and under multi label text classification scheme each document may belong to several classes simultaneously [3]. In real practice many approaches are exists and proposed for binary case and multi class case even though in many applications text documents are inherently multi label in nature. Eg. In medical diagnosis a document report containing set of symptoms can belong to many probable disease categories. Multilabel text classification problem refers to the scenario in which a text document can be assigned to more than one classes simultaneously during the process of classification. Eg. In the process of classification of online news article the news stories about the scams in the commonwealth games in india can belong to classes like sports, politics , country-india etc. It has attracted significant attention from lot of researchers for playing crucial role in many applications such as web page classification, classification of news articles , information retrieval etc. Multilabel text classification problem refers to the scenario in which a text document can be assigned to more than one classes simultaneously during the process of classification.. It has attracted significant attention from lot of researchers for playing crucial role in many applications such as web page classification, classification of news articles , information retrieval etc. Generally supervised methods from machine learning are mainly used for realization of multi label text classification. But as it needs labeled data for classification all the time, semi supervised methods are used now a day in multi label text classifier. Many approaches are preferred to implement multi label text classifier. Through our paper we are proposing label propagation approach for multi label text classifier , it uses existing label information for identifying labels of unlabeled documents.We are representing input text document corpus in the form of graph to exploit the ambiguity among different text documents. The ambiguity is represented in the form of similarity measures as a weighted edge between text documents . With the setting of semi supervised learning we have focused on not only graph construction but also sparsification and weighting of graph to improve classifiers accuracy. We apply the proposed framework on standard dataset such as Enron, Bibtex and slashdot. The rest of the paper is organized as below. Section 2 describes literature related to semi supervised learning methods for multi label text classification system ; Section 3 highlights mathematical modeling of our approach . Section 4 describes our proposed label propagation approach for building multi label text classifier followed by experiments and results in Section 5 , followed by a conclusion in the last section.