E-mail categorization using partially related training examples Maya Sappelli TNO and Radboud University Nijmegen m.sappelli@cs.ru.nl Suzan Verberne Radboud University Nijmegen s.verberne@cs.ru.nl Wessel Kraaij TNO and Radboud University Nijmegen kraaijw@acm.org ABSTRACT Automatic e-mail categorization with traditional classiﬁca- tion methods requires labelling of training data. In a real- life setting, this labelling disturbs the working ﬂow of the user. We argue that it might be helpful to use documents, which are generally well-structured in directories on the ﬁle system, as training data for supervised e-mail categoriza- tion and thereby reducing the labelling eﬀort required from users. Previous work demonstrated that the characteristics of documents and e-mail messages are too diﬀerent to use organized documents as training examples for e-mail catego- rization using traditional supervised classiﬁcation methods. In this paper we present a novel network-based algorithm that is capable of taking into account these diﬀerences be- tween documents and e-mails. With the network algorithm, it is possible to use documents as training material for e-mail categorization without user intervention. This way, the ef- fort for the users for labeling training examples is reduced, while the organization of their information ﬂow is still im- proved. The accuracy of the algorithm on categorizing e-mail mes- sages was evaluated using a set of e-mail correspondence related to the documents. The proposed network method was signiﬁcantly better than traditional text classiﬁcation algorithm in this setting. Categories and Subject Descriptors I.5 [Pattern Recognition]: Design Methodology; I.6.4 [Model validation and analysis]; H.1.2 [User/Machine Sys- tems]: Human factors General Terms Design, Performance, Human Factors Keywords E-mail classiﬁcation, categorization, transductive transfer learning Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proﬁt or commercial advantage and that copies bear this notice and the full citation on the ﬁrst page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior speciﬁc permission and/or a fee. IIiX ’14 August 26 - 29 2014, Regensburg, Germany Copyright is held by the owner/author(s). Publication rights licensed to ACM. Copyright 2014 ACM 978-1-4503-2976-7/14/08 $15.00. http://dx.doi.org/10.1145/2637002.2637014 ...$15.00. 1. INTRODUCTION The life of knowledge workers is changing rapidly. With the arrival of mobile internet, smart phones and the corre- sponding “any place any time information” it becomes in- creasingly hard to balance work life and personal life. Addi- tionally, knowledge workers need to be able to handle large amounts of data. Receiving more than 70 new corporate e-mail messages a day is not uncommon [17] so an eﬀective personal information management system is required to be able to organize and re-ﬁnd these messages. For this purpose ‘working in context’ is deemed beneﬁcial [8, 24]. Assistance of knowledge workers with ‘working in context’ is one of the goals of the SWELL project 1 for which this research is executed. One application area of interest is the e-mail domain. As- sociating e-mail messages with their contexts has two ben- eﬁts: 1) it can help knowledge workers ﬁnd back their mes- sages more easily and 2) reading messages context-wise, for example by project, is more eﬃcient since the number of context switches is minimized. This latter aspect is a sug- gestion from the ‘getting things done’ management method [2]. Many e-mail programs have an option to categorize or ﬁle messages, which allows for the possibility to associate messages with for example a ‘work-project’ context. This categorization option however, is often not used optimally, as messages are left to linger in the inbox [25] and many users do not even use category folders at all [12]. Manu- ally categorizing the messages is too big an eﬀort for busy knowledge workers, diminishing the actual beneﬁts of the categorization. Automated approaches for e-mail message classiﬁcation are plentiful. The early work in e-mail classiﬁcation was mostly directed towards detecting spam [18]. This was fol- lowed by work towards categorizing e-mails in order to sup- port personal information management [21, 4]. Nowadays, work on classifying e-mails is often directed towards pre- dicting the action required for the message [7, 1, 20]. Only automatic spam classiﬁcation has become a commodity in email handling. Categorization functionality within e-mail clients often relies on hand-crafted rules. The downside of the methods based on machine learn- ing is that each of them still requires labeled training data. Although this training dataset only needs to be a limited but representative part of all messages, it still requires ef- fort from the knowledge worker as they would need to label these examples. Especially the persons that receive the most 1 www.swell-project.net