Word Clustering using PLSA enhanced with Long Distance Bigrams Nikoletta Bassiou and Constantine Kotropoulos Department of Informatics, Aristotle University of Thessaloniki Box 451, Thessaloniki 541 24, GREECE {nbassiou, costas}@aiia.csd.auth.gr Abstract Probabilistic latent semantic analysis is enhanced with long distance bigram models in order to improve word clustering. The long distance bigram probabilities and the interpolated long distance bigram probabilities at varying distances within a context capture different aspects of contextual information. In addition, the base- line bigram, which incorporates trigger-pairs for vari- ous histories, is tested in the same framework. The ex- perimental results collected on publicly available cor- pora (CISI, Cranfield, Medline, and NPL) demonstrate the superiority of the long distance bigrams over the baseline bigrams as well as the superiority of the inter- polated long distance bigrams against the long distance bigrams and the baseline bigram with trigger-pairs in yielding more compact clusters containing less outliers. 1 Introduction Word clustering is one of the most challenging tasks in natural language processing [5]. In this paper, word clustering based on the Probabilistic Latent Semantic Analysis (PLSA) [3] is proposed that takes into con- sideration long distance bigram probabilities at vary- ing distances within a context as well as their interpo- lated variants and the probabilities of the baseline bi- gram with trigger-pairs for varying histories. The par- tition entropy coefficient of the derived clusterings re- veals the superiority of the interpolated long distance bigrams against the long distance bigrams and the bi- grams with trigger-pairs in producing more crisp clus- ters. In addition, the intra-cluster dispersion demon- strates that the use of interpolated long distance bi- grams generates meaningful clusters, similar to those formed when the bigram model is interpolated with trigger word pairs for various histories, eliminating the cluster outliers, which are observed when long distance bigrams are used. However, clustering with trigger pairs assigns similar words into more than one clusters, and needs appropriate trigger pair selection, which is not an easy task. 2 Language Modeling and the PLSA The n-gram model estimates the probability of a word given only the most recent n - 1 preceding words [2]. Frequently, the bigram or the trigram models are employed only. For long distance bigrams [4], a word w i is predicted by the d-th preceding word w i-d . It is obvious that for d =1, the long distance bigram degen- erates to the baseline bigram. The efficiency of the long distance bigram model can be further enhanced by es- timating the probability of long distance bigrams in H different distances [7]. The PLSA performs a probabilistic mixture decom- position by defining a generative latent data model, the so called aspect model, which associates an unobserved class variable z k Z = {z 1 ,z 2 ,...,z R } with each observation. Here, the observation is simply the oc- currence of a word w j V = {w 1 ,w 2 ,...,w Q } in a text/document t i T = {t 1 ,t 2 ,...,t M }, while the unobserved class variable z k models the topic a text was generated from. Summing over all possible realizations of z k , the joint distribution of the observed data is ob- tained P (t i ,w j )= P (t i ) R k=1 P (z k |t i )P (w j |z k )  P (w j |t i ) . (1) As can be seen in (1), the text-specific word distribu- tions P (w j |t i ) are obtained by a convex combination of the R aspects/factors P (w j |z k ). Representing each text t i as a sequence of words <v 1 v 2 ... v Q i >, where Q i is the number of words in text t i , P (t i ,w j ) can be decomposed as follows: P (t i ,w j )= P (v Q i |v Q i-1 ... v 1 ,w j )· ·P (v Q i-1 |v Q i-2 ... v 1 ,w j ) ... P (v 1 |w j ) P (w j ). (2) 2010 International Conference on Pattern Recognition 1051-4651/10 $26.00 © 2010 IEEE DOI 10.1109/ICPR.2010.1027 4210 2010 International Conference on Pattern Recognition 1051-4651/10 $26.00 © 2010 IEEE DOI 10.1109/ICPR.2010.1027 4234 2010 International Conference on Pattern Recognition 1051-4651/10 $26.00 © 2010 IEEE DOI 10.1109/ICPR.2010.1027 4226 2010 International Conference on Pattern Recognition 1051-4651/10 $26.00 © 2010 IEEE DOI 10.1109/ICPR.2010.1027 4226 2010 International Conference on Pattern Recognition 1051-4651/10 $26.00 © 2010 IEEE DOI 10.1109/ICPR.2010.1027 4226