Structural Poisson Mixtures for Classification of Documents Jiˇ ı Grim, Jana Novoviˇ cov´ a, Petr Somol Institute of Information Theory and Automation P.O.BOX 18, CZ-18208 Prague 8, Czech Republic grim@utia.cas.cz, novovic@utia.cas.cz, somol@utia.cas.cz Abstract Considering the statistical text classification prob- lem we approximate class-conditional probability dis- tributions by structurally modified Poisson mixtures. By introducing the structural model we can use different subsets of input variables to evaluate conditional prob- abilities of different classes in the Bayes formula. The method is applicable to document vectors of arbitrary dimension without any preprocessing. The structural optimization can be included into the EM algorithm in a statistically correct way. 1. Introduction Text classification as a problem of automatic sort- ing of documents into predefined classes is important in many information retrieval tasks. Various statistical and machine learning techniques have been explored to build automatically a classifier by learning from previ- ously labeled documents. For a discussion of different approaches see e.g. Sebastiani [8]. We consider classification of the text documents in a Bayesian learning framework with a bag-of-words doc- ument representation. There are two common models in the representation of text documents (see e.g. [7], [6]). The multivariate Bernoulli model represents each doc- ument by a vector of binary feature variables indicating whether or not a certain word occurs in the document. Alternatively, in the multinomial model, the features are defined as frequencies of the related vocabulary terms in the document. In both cases the dimension of document vectors is very high because of a large number of vo- cabulary terms and therefore different feature selection methods have to be used as a rule [1]. Unfortunately, it is difficult to reduce the size of vocabulary since there are many different classes having different subsets of characteristic terms. An informative subset of features common to all classes often represents a difficult com- promise possibly connected with a loss of classification accuracy. In this paper we propose the use of a structural mixture of multivariate Poisson distributions to learn Bayesian text classifier. By introducing binary struc- tural parameters we can reduce the evaluation of the Bayes formula only to subsets of informative variables which may be different for different classes and even for different mixture components. In this way we can reduce the number of parameters in the conditional dis- tributions without reducing the number of vocabulary terms. The paper is organized as follows. In Section 2 we describe the problem of statistical text classifica- tion, Section 3 introduces the structural Poisson mix- ture model and Section 4 describes the corresponding modified EM algorithm. In Section 5 we describe the computational experiments and finally we summarize the results in the Conclusion. 2. Statistical Document Classification We assume that after standard preprocessing a text document d is reduced to a finite list of terms from a given vocabulary V d = w i1 ,...,w i k , w i l ∈V = {t 1 ,...,t N }. (1) The vocabulary is chosen to characterize the semantic meaning of documents with a limited number of highly informative specific terms. For the sake of classification we ignore common short or rare words and disregard the position of words in the original document. A document is treated as a “bag of words”, only the frequency of vocabulary terms is considered. In this sense, denoting x n the frequency of the term t n ∈V , we describe a document (1) by N -dimensional vector of integers x = x(d)=(x 1 ,x 2 ,...,x N ) ∈X = N . (2) In the following we denote |x| the length of document x which may correspond to the total number of words in