Pseudo-Supervised Clustering for Text Documents M. Maggini, L. Rigutini, and M. Turchi Dipartimento di Ingegneria dell’Informazione Universit` a di Siena Via Roma, 56 - Siena, Italy {maggini,rigutini,turchi}@dii.unisi.it Abstract Effective solutions for Web search engines can take ad- vantage of algorithms for the automatic organization of documents into homogeneous clusters. Unfortunately, doc- ument clustering is not an easy task especially when the documents share a common set of topics, like in vertical search engines. In this paper we propose two clustering algorithms which can be tuned by the feedback of an ex- pert. The feedback is used to choose an appropriate basis for the representation of documents, while the clustering is performed in the projected space. The algorithms are evalu- ated on a dataset containing papers from computer science conferences. The results show that an appropriate choice of the representation basis can yield better performance with respect to the original vector space model. 1 Introduction Search engines are the most used service to access the resources available on the Web. One of the main issues in the design of the search interface is to properly orga- nize the query results in order to ease the selection of the most appropriate result. Thus, proper ranking schemes, like the PageRank used by the Google search engine, have been proposed to order the result list according to an absolute and user independent criterion. However, some other ap- proaches have been used to organize the results in groups in order to direct the user choice to the most interesting re- sult subset (e.g. vivisimo.com). Another interesting feature of a search engine is to access a list of documents similar to a given one. This approach can be particularly useful for focused search engines, where the documents belong to a restricted set of topics and the formulation of a pre- cise keyword-based query might be difficult. For example, the Citeseer (www.citeseer.com) search engine provides a widely used service to retrieve computer science papers gathered from the Web and in its current version provides a navigation through the document corpus based on a hier- archical directory of topics. Since it is unfeasible to man- ually organize all the documents in a Web search engine, the application of automatic text processing techniques, like classification and clustering, is increasing. In this paper we propose two clustering methods that can exploit the feedback of an expert. The two methods consist of two steps. In the first step a set of example documents is organized in groups either automatically or manually by an expert. This set is used to compute a basis for the repre- sentation of the documents. Two schemes proposed in the literature have been adopted: the Singular Value Decom- position (SVD) [4] and a variation of the Concept Matrix Decomposition (CMD) [5]. Then, the entire document cor- pus is represented using the chosen vector basis and is parti- tioned using a clustering algorithm. Thus the supervision of the expert can be used to bias the document representation to reflect the human clustering criteria. The paper is organized as follows. In the next sec- tion we introduce the vector model representation used for documents and the dimensionality reduction techniques proposed for Information Retrieval. In the section 3 we describe the proposed pseudo-supervised clustering algo- rithms. Section 4 defines some indexes that can be used to evaluate the clustering results. Finally, in section 5 the results on a dataset containing about 1000 full papers from computer science conferences are reported and in section 6 the conclusions are drawn. 2 Document representation In Automatic Text Processing, a widely used represen- tation for text documents is the Vector Space Model [12]. In this model each document is represented by a vector in a |V | dimensional space where V is the term vocabulary. The value of i-th component of j-th vector is the weight of the i-th word in the j-th document. The most used schemes for words weighting are the tf and the tf-idf [13]. In the first scheme the weight is the term frequency in the document,