Pattern Recognition 42 (2009) 2950--2960 Contents lists available at ScienceDirect Pattern Recognition journal homepage: www.elsevier.com/locate/pr A new dual wing harmonium model for document retrieval Haijun Zhang, Tommy W.S. Chow , M.K.M. Rahman Department of Electronic Engineering, City University of Hong Kong, 83 Tat Chee Avenue, Kowloon, Hong Kong ARTICLE INFO ABSTRACT Article history: Received 14 October 2008 Received in revised form 16 February 2009 Accepted 15 March 2009 Keywords: Dual wing harmonium Term connection Graph representation Document retrieval Multiple features A new dual wing harmonium model that integrates term frequency features and term connection fea- tures into a low dimensional semantic space without increase of computation load is proposed for the application of document retrieval. Terms and vectorized graph connectionists are extracted from the graph representation of document by employing weighted feature extraction method. We then develop a new dual wing harmonium model projecting these multiple features into low dimensional latent topics with different probability distributions assumption. Contrastive divergence algorithm is used for efficient learning and inference. We perform extensive experimental verification, and the comparative results suggest that the proposed method is accurate and computationally efficient for document retrieval. © 2009 Elsevier Ltd. All rights reserved. 1. Introduction The rapid development of Internet has made massive amount of document data available and easy access to people's lives, which leads to a growing demand of higher accuracy and speed for document retrieval. Document retrieval refers to finding similar documents for a given user's query. A user's query can be ranged from a full description of a document to a few keywords. Most of the extensively used retrieval approaches are keywords-based searching methods, e.g. www.google.com, in which untrained users provide a few keywords to the search engine finding the relevant documents in a returned list. Another type of document retrieval is to use a query document to search similar ones. Using an entire document as a query performs well in improving retrieval accu- racy, but it is more computationally demanding compared with the keywords-based method. Most existing document retrieval systems only use term frequency as feature units to build statistical mod- els and develop natural language processing (NLP) approaches for document retrieval [1]. Usually the connections among terms are overlooked which results in losing important semantic information of documents. To exploit rich information in documents and en- hance the performance of relevant data mining, it is often necessary to model more features extracted from documents into a lower dimensional semantic space. Vector space model (VSM) [2], the most popular and widely used term frequency (tf )–inverse-document-frequency (idf ) scheme, uses Corresponding author. Tel.: +852 27887756; fax: +852 27887791. E-mail address: eetchow@cityu.edu.hk (T.W.S. Chow). 0031-3203/$ - see front matter © 2009 Elsevier Ltd. All rights reserved. doi:10.1016/j.patcog.2009.03.021 a basic vocabulary of “words” or “terms” for feature description. The term frequency is the number of occurrences of each term, and the inverse-document-frequency is a function of the number of docu- ment where a term took place. A term weighted vector is constructed for each document using tf and idf. Similarity between two docu- ments is then measured using “cosine” distance or any other dis- tance functions [3]. Thus, the VSM scheme reduces arbitrary length of term vector in each document to fixed length. But a lengthy vec- tor is required for describing the frequency information of terms, because the number of words involved is usually huge. This causes a significant increase of computational burden making the VSM model impractical for large corpus. In addition, VSM scheme reveals lit- tle statistical structure about a document because of only using low level document features (i.e. term frequency). To overcome the shortcomings of VSM, researchers have pro- posed several dimensionality reduction methods with low dimen- sional latent representations to capture document semantics. Latent semantic indexing (LSI) [4], an extension from VSM model, maps the documents and terms to a latent space representation by per- forming a linear projection to compress the feature vector of the VSM model into low dimension. Singular value decomposition (SVD) is employed to find the hidden semantic association between term and document for conceptual indexing. In addition to feature compression, LSI model is useful in encoding the semantics [5].A step forward in probabilistic models is probabilistic latent semantic indexing (PLSI) [6] that defines a proper generative model of data to model each word in a document as a sample from a mixture distri- bution and develop factor representations for mixture components. Chien and Wu [7] further developed an adaptive Bayesian PLSI for incremental learning and corrective training that was designed to retrieve relevant documents in the presence of changing domain or