E-mail categorization using partially related training examples Maya Sappelli TNO and Radboud University Nijmegen m.sappelli@cs.ru.nl Suzan Verberne Radboud University Nijmegen s.verberne@cs.ru.nl Wessel Kraaij TNO and Radboud University Nijmegen kraaijw@acm.org ABSTRACT Automatic e-mail categorization with traditional classifica- tion methods requires labelling of training data. In a real- life setting, this labelling disturbs the working flow of the user. We argue that it might be helpful to use documents, which are generally well-structured in directories on the file system, as training data for supervised e-mail categoriza- tion and thereby reducing the labelling effort required from users. Previous work demonstrated that the characteristics of documents and e-mail messages are too different to use organized documents as training examples for e-mail catego- rization using traditional supervised classification methods. In this paper we present a novel network-based algorithm that is capable of taking into account these differences be- tween documents and e-mails. With the network algorithm, it is possible to use documents as training material for e-mail categorization without user intervention. This way, the ef- fort for the users for labeling training examples is reduced, while the organization of their information flow is still im- proved. The accuracy of the algorithm on categorizing e-mail mes- sages was evaluated using a set of e-mail correspondence related to the documents. The proposed network method was significantly better than traditional text classification algorithm in this setting. Categories and Subject Descriptors I.5 [Pattern Recognition]: Design Methodology; I.6.4 [Model validation and analysis]; H.1.2 [User/Machine Sys- tems]: Human factors General Terms Design, Performance, Human Factors Keywords E-mail classification, categorization, transductive transfer learning Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. IIiX ’14 August 26 - 29 2014, Regensburg, Germany Copyright is held by the owner/author(s). Publication rights licensed to ACM. Copyright 2014 ACM 978-1-4503-2976-7/14/08 $15.00. http://dx.doi.org/10.1145/2637002.2637014 ...$15.00. 1. INTRODUCTION The life of knowledge workers is changing rapidly. With the arrival of mobile internet, smart phones and the corre- sponding “any place any time information” it becomes in- creasingly hard to balance work life and personal life. Addi- tionally, knowledge workers need to be able to handle large amounts of data. Receiving more than 70 new corporate e-mail messages a day is not uncommon [17] so an effective personal information management system is required to be able to organize and re-find these messages. For this purpose ‘working in context’ is deemed beneficial [8, 24]. Assistance of knowledge workers with ‘working in context’ is one of the goals of the SWELL project 1 for which this research is executed. One application area of interest is the e-mail domain. As- sociating e-mail messages with their contexts has two ben- efits: 1) it can help knowledge workers find back their mes- sages more easily and 2) reading messages context-wise, for example by project, is more efficient since the number of context switches is minimized. This latter aspect is a sug- gestion from the ‘getting things done’ management method [2]. Many e-mail programs have an option to categorize or file messages, which allows for the possibility to associate messages with for example a ‘work-project’ context. This categorization option however, is often not used optimally, as messages are left to linger in the inbox [25] and many users do not even use category folders at all [12]. Manu- ally categorizing the messages is too big an effort for busy knowledge workers, diminishing the actual benefits of the categorization. Automated approaches for e-mail message classification are plentiful. The early work in e-mail classification was mostly directed towards detecting spam [18]. This was fol- lowed by work towards categorizing e-mails in order to sup- port personal information management [21, 4]. Nowadays, work on classifying e-mails is often directed towards pre- dicting the action required for the message [7, 1, 20]. Only automatic spam classification has become a commodity in email handling. Categorization functionality within e-mail clients often relies on hand-crafted rules. The downside of the methods based on machine learn- ing is that each of them still requires labeled training data. Although this training dataset only needs to be a limited but representative part of all messages, it still requires ef- fort from the knowledge worker as they would need to label these examples. Especially the persons that receive the most 1 www.swell-project.net