An online document clustering technique for short web contents Moreno Carullo * , Elisabetta Binaghi, Ignazio Gallo Università degli Studi dell’Insubria, Dipartimento di Informatica e Comunicazione, 21100 Varese, Italy article info Article history: Received 19 June 2008 Received in revised form 30 January 2009 Available online 10 April 2009 Communicated by M.-J. Li Keywords: Online clustering Short documents analysis Similarity measures abstract Document clustering techniques have been applied in several areas, with the web as one of the most recent and inﬂuential. Both general-purpose and text-oriented techniques exist and can be used to cluster a collection of documents in many ways. This work proposes a novel heuristic online document clustering model that can be specialized with a variety of text-oriented similarity measures. An experimental evaluation of the proposed model was conducted in the e-commerce domain. Performances were measured using a clustering-oriented metric based on F-Measure and compared with those obtained by other well-known approaches. The obtained results conﬁrm the validity of the proposed method both for batch scenarios and online scenarios where document collections can grow over time. Ó 2009 Elsevier B.V. All rights reserved. 1. Introduction Vast amounts of unstructured documents are available in many ﬁelds, and when we think about the web, documents can be re- trieved from all over. However, to exploit the potential of such information, for example for the discovery of new facts or under- standing the popularity of some concept or product, one should group similar documents into clusters. Keeping in mind the web scenario, where a huge amount of documents are available and growing every day, document clustering approaches with polyno- mial or exponential time complexity may prove unsatisfactory. The document clustering process (Frakes and Baeza-Yates, 1992), an instance of the cluster analysis paradigm, takes into account the problem of dividing a collection of documents D ¼fd 1 ; ... ; d n g into subsets fD 1 ; ... ; D c g such that all the d j 2 D i and i ¼ 1; ... ; c are more similar to each other with respect to a gi- ven similarity measure S than to other documents outside the cluster. Document clustering techniques can be divided into ﬂat and hierarchical approaches (Willett, 1988), both of which can in turn be subdivided into soft and hard clustering, with soft clustering providing the added ability to assign a single document to multiple clusters (with a degree of membership). Soft and hierarchical techniques provide additional information and require a supple- mentary computational burden that should be considered only if needed. Within the ﬂat and hard paradigm, the general purpose K-Means algorithm (MacQueen, 1967; Hartigan and Wang, 1979) has been successfully applied in the document analysis domain. In (Steinbach et al., 2000) the K-Means, compared to the well- known text-oriented Hierarchical Agglomerative Clustering Meth- od (HACM) (Frakes and Baeza-Yates, 1992), show a competitive behavior. A soft version of the K-Means, the Expectation Maximiza- tion algorithm (Dempster et al., 1977), is also widely used within the document analysis domain. Unsupervised neural models have been investigated in depth to solve clustering problems. In particular following a ﬂat and soft clustering approach Self-Organizing Maps (Kohonen, 1995) have been employed for document clustering tasks (Lagus et al., 2004) and are particularly suitable when a meaningful and browsable 2D map of the considered document collection is required. Online algorithms (Duda et al., 2000) are designed to process data as they become available, incrementally. This is in contrast with standard batch approaches where the data are available from the start of the processing phase and thus a global optimization algorithm can be applied. Consequently online algorithms tend to perform local optimization on new data, with overall running time usually shorter than that of the batch counterparts. The ﬂat and hard Single Pass algorithm (Frakes and Baeza-Yates, 1992) has its roots in the early works with online document clustering (Salton, 1971) where strict time and space resource constraints played a major role. The Leader–Follower algorithm (Duda et al., 2000), developed within the Adaptive Resonance Theory (ART), has a substantially similar structure. More complex ART models dealing with the plasticity/stability dilemma have been developed both for supervised and unsupervised tasks, and a vigilance parameter is included as a mean to either permit or avoid the update of the clusters state. In (Zamir and Etzioni, 1998) an incremental, online algorithm for web document clustering is presented. The approach uses common sub-phrases to build document clusters and is designed 0167-8655/$ - see front matter Ó 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2009.04.001 * Corresponding author. Tel.: +39 0332 218941; fax: +39 0332 218909. E-mail address: moreno.carullo@uninsubria.it (M. Carullo). Pattern Recognition Letters 30 (2009) 870–876 Contents lists available at ScienceDirect Pattern Recognition Letters journal homepage: www.elsevier.com/locate/patrec