Pattern Recognition 42 (2009) 2950--2960
Contents lists available at ScienceDirect
Pattern Recognition
journal homepage: www.elsevier.com/locate/pr
A new dual wing harmonium model for document retrieval
Haijun Zhang, Tommy W.S. Chow
∗
, M.K.M. Rahman
Department of Electronic Engineering, City University of Hong Kong, 83 Tat Chee Avenue, Kowloon, Hong Kong
ARTICLE INFO ABSTRACT
Article history:
Received 14 October 2008
Received in revised form 16 February 2009
Accepted 15 March 2009
Keywords:
Dual wing harmonium
Term connection
Graph representation
Document retrieval
Multiple features
A new dual wing harmonium model that integrates term frequency features and term connection fea-
tures into a low dimensional semantic space without increase of computation load is proposed for the
application of document retrieval. Terms and vectorized graph connectionists are extracted from the
graph representation of document by employing weighted feature extraction method. We then develop a
new dual wing harmonium model projecting these multiple features into low dimensional latent topics
with different probability distributions assumption. Contrastive divergence algorithm is used for efficient
learning and inference. We perform extensive experimental verification, and the comparative results
suggest that the proposed method is accurate and computationally efficient for document retrieval.
© 2009 Elsevier Ltd. All rights reserved.
1. Introduction
The rapid development of Internet has made massive amount of
document data available and easy access to people's lives, which
leads to a growing demand of higher accuracy and speed for
document retrieval. Document retrieval refers to finding similar
documents for a given user's query. A user's query can be ranged
from a full description of a document to a few keywords. Most
of the extensively used retrieval approaches are keywords-based
searching methods, e.g. www.google.com, in which untrained users
provide a few keywords to the search engine finding the relevant
documents in a returned list. Another type of document retrieval
is to use a query document to search similar ones. Using an entire
document as a query performs well in improving retrieval accu-
racy, but it is more computationally demanding compared with the
keywords-based method. Most existing document retrieval systems
only use term frequency as feature units to build statistical mod-
els and develop natural language processing (NLP) approaches for
document retrieval [1]. Usually the connections among terms are
overlooked which results in losing important semantic information
of documents. To exploit rich information in documents and en-
hance the performance of relevant data mining, it is often necessary
to model more features extracted from documents into a lower
dimensional semantic space.
Vector space model (VSM) [2], the most popular and widely used
term frequency (tf )–inverse-document-frequency (idf ) scheme, uses
∗
Corresponding author. Tel.: +852 27887756; fax: +852 27887791.
E-mail address: eetchow@cityu.edu.hk (T.W.S. Chow).
0031-3203/$ - see front matter © 2009 Elsevier Ltd. All rights reserved.
doi:10.1016/j.patcog.2009.03.021
a basic vocabulary of “words” or “terms” for feature description. The
term frequency is the number of occurrences of each term, and the
inverse-document-frequency is a function of the number of docu-
ment where a term took place. A term weighted vector is constructed
for each document using tf and idf. Similarity between two docu-
ments is then measured using “cosine” distance or any other dis-
tance functions [3]. Thus, the VSM scheme reduces arbitrary length
of term vector in each document to fixed length. But a lengthy vec-
tor is required for describing the frequency information of terms,
because the number of words involved is usually huge. This causes a
significant increase of computational burden making the VSM model
impractical for large corpus. In addition, VSM scheme reveals lit-
tle statistical structure about a document because of only using low
level document features (i.e. term frequency).
To overcome the shortcomings of VSM, researchers have pro-
posed several dimensionality reduction methods with low dimen-
sional latent representations to capture document semantics. Latent
semantic indexing (LSI) [4], an extension from VSM model, maps
the documents and terms to a latent space representation by per-
forming a linear projection to compress the feature vector of the
VSM model into low dimension. Singular value decomposition
(SVD) is employed to find the hidden semantic association between
term and document for conceptual indexing. In addition to feature
compression, LSI model is useful in encoding the semantics [5].A
step forward in probabilistic models is probabilistic latent semantic
indexing (PLSI) [6] that defines a proper generative model of data to
model each word in a document as a sample from a mixture distri-
bution and develop factor representations for mixture components.
Chien and Wu [7] further developed an adaptive Bayesian PLSI for
incremental learning and corrective training that was designed to
retrieve relevant documents in the presence of changing domain or