Improving the Dynamic Hierarchical Compact Clustering Algorithm by Using Feature Selection Reynaldo Gil-Garc´ ıa and Aurora Pons-Porrata Center for Pattern Recognition and Data Mining Universidad de Oriente, Santiago de Cuba, Cuba {gil,aurora}@cerpamid.co.cu Abstract. Feature selection has improved the performance of text clus- tering. In this paper, a local feature selection technique is incorporated in the dynamic hierarchical compact clustering algorithm to speed up the computation of similarities. We also present a quality measure to evalu- ate hierarchical clustering that considers the cost of finding the optimal cluster from the root. The experimental results on several benchmark text collections show that the proposed method is faster than the origi- nal algorithm while achieving approximately the same clustering quality. 1 Introduction Managing, accessing, searching and browsing large repositories of text documents require efficient organization of the information. In dynamic information envi- ronments, such as the World Wide Web or the stream of newspaper articles, it is usually desirable to apply adaptive methods for document organization such as clustering. Dynamic algorithms have the ability to update the clustering when data are added or removed from the collection. These algorithms allow us dynam- ically tracking the ever-changing large scale information being put or removed from the web everyday, without having to perform complete re-clustering. Hierarchical clustering algorithms have an additional interest, because they provide data-views at different levels of abstraction, making them ideal for people to visualize and interactively explore large document collections. In the context of hierarchical document clustering, the high dimensionality of the data and the large size of collections are the major challenges facing researchers today. In [1], a hierarchical clustering algorithm, namely dynamic hierarchical com- pact (DHC ) was presented. It is not only able to deal with dynamic data while achieving a similar clustering quality than static state-of-the-art hierarchical al- gorithms, but also has a linear computational complexity with respect to the number of dimensions. It uses a multi-layered clustering to update the hierarchy when new documents arrive (or are removed). The process in each layer involves two steps: the updating of similarity-based graphs and the obtaining of the con- nected components for these graphs. The graph updating requires to compute the similarities between clusters, which is the most time-consuming operation.