Word Clustering using PLSA enhanced with Long Distance Bigrams
Nikoletta Bassiou and Constantine Kotropoulos
Department of Informatics, Aristotle University of Thessaloniki
Box 451, Thessaloniki 541 24, GREECE
{nbassiou, costas}@aiia.csd.auth.gr
Abstract
Probabilistic latent semantic analysis is enhanced
with long distance bigram models in order to improve
word clustering. The long distance bigram probabilities
and the interpolated long distance bigram probabilities
at varying distances within a context capture different
aspects of contextual information. In addition, the base-
line bigram, which incorporates trigger-pairs for vari-
ous histories, is tested in the same framework. The ex-
perimental results collected on publicly available cor-
pora (CISI, Cranfield, Medline, and NPL) demonstrate
the superiority of the long distance bigrams over the
baseline bigrams as well as the superiority of the inter-
polated long distance bigrams against the long distance
bigrams and the baseline bigram with trigger-pairs in
yielding more compact clusters containing less outliers.
1 Introduction
Word clustering is one of the most challenging tasks
in natural language processing [5]. In this paper, word
clustering based on the Probabilistic Latent Semantic
Analysis (PLSA) [3] is proposed that takes into con-
sideration long distance bigram probabilities at vary-
ing distances within a context as well as their interpo-
lated variants and the probabilities of the baseline bi-
gram with trigger-pairs for varying histories. The par-
tition entropy coefficient of the derived clusterings re-
veals the superiority of the interpolated long distance
bigrams against the long distance bigrams and the bi-
grams with trigger-pairs in producing more crisp clus-
ters. In addition, the intra-cluster dispersion demon-
strates that the use of interpolated long distance bi-
grams generates meaningful clusters, similar to those
formed when the bigram model is interpolated with
trigger word pairs for various histories, eliminating the
cluster outliers, which are observed when long distance
bigrams are used. However, clustering with trigger pairs
assigns similar words into more than one clusters, and
needs appropriate trigger pair selection, which is not an
easy task.
2 Language Modeling and the PLSA
The n-gram model estimates the probability of a
word given only the most recent n - 1 preceding words
[2]. Frequently, the bigram or the trigram models are
employed only. For long distance bigrams [4], a word
w
i
is predicted by the d-th preceding word w
i-d
. It is
obvious that for d =1, the long distance bigram degen-
erates to the baseline bigram. The efficiency of the long
distance bigram model can be further enhanced by es-
timating the probability of long distance bigrams in H
different distances [7].
The PLSA performs a probabilistic mixture decom-
position by defining a generative latent data model, the
so called aspect model, which associates an unobserved
class variable z
k
∈ Z = {z
1
,z
2
,...,z
R
} with each
observation. Here, the observation is simply the oc-
currence of a word w
j
∈ V = {w
1
,w
2
,...,w
Q
} in
a text/document t
i
∈ T = {t
1
,t
2
,...,t
M
}, while the
unobserved class variable z
k
models the topic a text was
generated from. Summing over all possible realizations
of z
k
, the joint distribution of the observed data is ob-
tained
P (t
i
,w
j
)= P (t
i
)
R
k=1
P (z
k
|t
i
)P (w
j
|z
k
)
P (w
j
|t
i
)
. (1)
As can be seen in (1), the text-specific word distribu-
tions P (w
j
|t
i
) are obtained by a convex combination of
the R aspects/factors P (w
j
|z
k
). Representing each text
t
i
as a sequence of words <v
1
v
2
... v
Q
i
>, where
Q
i
is the number of words in text t
i
, P (t
i
,w
j
) can be
decomposed as follows:
P (t
i
,w
j
)= P (v
Q
i
|v
Q
i-1
... v
1
,w
j
)·
·P (v
Q
i-1
|v
Q
i-2
... v
1
,w
j
) ... P (v
1
|w
j
) P (w
j
). (2)
2010 International Conference on Pattern Recognition
1051-4651/10 $26.00 © 2010 IEEE
DOI 10.1109/ICPR.2010.1027
4210
2010 International Conference on Pattern Recognition
1051-4651/10 $26.00 © 2010 IEEE
DOI 10.1109/ICPR.2010.1027
4234
2010 International Conference on Pattern Recognition
1051-4651/10 $26.00 © 2010 IEEE
DOI 10.1109/ICPR.2010.1027
4226
2010 International Conference on Pattern Recognition
1051-4651/10 $26.00 © 2010 IEEE
DOI 10.1109/ICPR.2010.1027
4226
2010 International Conference on Pattern Recognition
1051-4651/10 $26.00 © 2010 IEEE
DOI 10.1109/ICPR.2010.1027
4226