User Models for Email Activity Management Mark Dredze Dept. of Computer and Information Science University of Pennsylvania Philadelphia, PA 19104, USA mdredze@cis.upenn.edu Hanna M. Wallach Department of Computer Science University of Massachusetts, Amherst Amherst, MA, USA wallach@cs.umass.edu INTRODUCTION A single user activity, such as planning a conference trip, typically involves multiple actions. Although these actions may involve several applications, the central point of co- ordination for any particular activity is usually email. Pre- vious work on email activity management has focused on clustering emails by activity. Dredze et al. [3] accomplished this by combining supervised classifiers based on document similarity, authors and recipients, and thread information. In this paper, we take a different approach and present an unsu- pervised framework for email activity clustering. We use the same information sources as Dredze et al.—namely, docu- ment similarity, message recipients and authors, and thread information—but combine them to form an unsupervised, non-parametric Bayesian user model. This approach enables email activities to be inferred without any user input. Infer- ring activities from a user’s mailbox adapts the model to that user. We next describe the statistical machinery that forms the basis of our user model, and explain how several email properties may be incorporated into the model. We evaluate this approach using the same data as Dredze et al., showing that our model does well at clustering emails by activity. DIRICHLET PROCESS CLUSTERING OF EMAILS Clustering emails by activity involves assigning n email mes- sages d 1 ...d n to k activities a 1 ...a k . Each document is represented by a sparse vector, indicating the number of times each word in the vocabulary appears in that document. One way of modeling these data is to assume that each docu- ment was generated by a single activity-specific distribution over words. This model naturally captures the notion that emails about the same activity will use similar words, while emails about different activities will use different words. A Dirichlet process mixture model (DPMM) provides an el- egant way of formalizing this idea. Each document d i is assumed to have been generated by first selecting an activ- ity a i = j for that document and then drawing words from the activity-specific distribution over words θ (j) . This pro- cess may be inverted using statistical inference techniques, allowing the unknown activity assignments a i and activity- specific distributions over words θ (j) to be inferred from a set of unlabeled documents. Furthermore, advance speci- fication of the number of activities is not required—this is automatically determined from the data. A user-specific DPMM may be constructed by clustering the emails in the user’s inbox into activities—future emails will be assigned to activities on the basis of this user-specific clustering. The latent activity assignments may be inferred using Gibbs sampling [4] as follows. The probability of assigning document d i to activity j is P (a i = j | a −i ,d 1 ,...,d n ) ∝ P (a i = j | a −i )P (d i | d −i ,a i = j, a −i ) (1) where d −i denotes all documents excluding d i and a −i de- notes the activity assignments for these documents. The first term, P (a i = j | a −i ), is the prior probability of choosing activity j and is given by P (a i = j | a −i )= Nj α+N· j exists α α+N· j is new, (2) where N j is the number of documents assigned to activity j (excluding d i ) and N · = ∑ j N j . The parameter α de- termines the rate with which new activities are created. A priori, document d i is more likely to be assigned to an activ- ity that already has many documents associated with it. The second term, P (d i | d −i ,a i = j, a −i ), is the probabil- ity that document d i was generated by activity j , given all other activity assignments. This enforces the requirement that documents containing similar words be assigned to the same activity. P (d i | d −i ,a i = j, a −i ) may be computed by marginalizing over all possible values of θ (j) under a sym- metric Dirichlet prior with scaling parameter β: P (d i | d −i ,a i = j, a −i ) = dθ (j) P (d i | θ (j) )P (θ (j) | d −i ,a −i ) = Γ(β + N ·| j ) w Γ( β W + N w | j ) w Γ( β W + N w | j + M w ) Γ(β + N ·| j + M · ) , (3) where N w | j is the number of times word w has been used in all the documents assigned to activity j (excluding d i ), M w is the number of times w has been used in d i , N ·| j = ∑ w N w | j and M · = ∑ w M w . GENERALIZING THE CLUSTERING PRIOR Equation 2 is the distribution over activities under a Dirichlet process prior, however, there are other clustering priors that