1 A Text Clustering Framework for Information Retrieval Sergio Decherchi, Paolo Gastaldo, Judith Redi and Rodolfo Zunino Dept. Biophysical and Electronic Engineering, University of Genoa, 16145 Genoa, Italy {sergio.decherchi, paolo.gastaldo, judith.redi, rodolfo.zunino}@unige.it Abstract: Text-mining methods have become a key feature for homeland-security technologies, as they can help explore effectively increasing masses of digital documents in the search for relevant information. This research presents a model for document clustering that arranges unstructured documents into content-based homogeneous groups. The overall paradigm is hybrid because it combines pattern-recognition grouping algorithms with semantic- driven processing. First, a semantic-based metric measures distances between documents, by combining content-based and behavioral analysis. Such a metric allows taking into account the lexical properties, the structure and the styles characterizing the processed documents. In a second step, the model relies on a Radial Basis Function (RBF) kernel-based mapping for clustering documents. As a result, the major novelty aspect of the proposed approach is to exploit the implicit mapping of RBF kernel functions to tackle the crucial task of normalizing similarities, while embedding semantic information in the whole mechanism. Experimental results on Reuters and Newsgroup 20 databases validate the proposed approach. Keywords: document clustering, homeland security, kernel k- means., documents similarity, text mining, unsupervised learning 1. Introduction The automated surveillance of information sources is of strategic importance to effective homeland security [1],[2]. The increased availability of data-intensive heterogeneous sources provides a valuable asset for the intelligence tasks; data-mining methods have therefore become a key feature for security-related technologies [2],[3], as they can help in effectively exploring increasing masses of digital data when searching for relevant information. Text mining techniques provide a powerful tool to deal with large amounts of unstructured text data gathered from heterogeneous multimedia sources (e.g. Optical Character Recognition, audio via speech transcription, web-crawling agents, etc., see fig.1) [4],[5]. Text mining methods can be applied successfully in the network security domain, following several approaches [5]: detection/tracking tools can be used to continuously monitor specific topics over time; document classifiers label individual files and build up models for possible subjects of interest; relations among the selected subjects can be then detected with the help of clustering tools. As a result, text mining can profitably support intelligence and security activities in identifying, tracking, extracting, classifying and discovering patterns, so that the outcomes can generate alert notifications accordingly [6],[7]. This work addresses document clustering and presents a dynamic, adaptive clustering model to arrange unstructured documents into content-based homogeneous groups. The framework implements a hybrid paradigm, which combines a content-driven similarity processing with pattern-recognition grouping algorithms. Distances between documents are worked out by a semantic- based hypermetric: the specific approach integrates a content-based with a user-behavioral analysis, as it takes into account both lexical and style-related features of the documents at hand. The core clustering strategy exploits a kernel-based version of the conventional k-means algorithm [8]; the present implementation relies on a Radial Basis Function (RBF) kernel-based mapping [9]. The advantage of using such a kernel consists in supporting normalization implicitly; normalization is a critical issue in most text- mining applications, and prevents that extensive properties of documents (such as length, lexicon, etc) may distort representation and affect performance. Standard benchmarks for content-based document management, the Reuters database [10] and Newsgroup 20 database [11], provided the experimental domain for the proposed methodology. The research shows that the document clustering framework based on kernel k-means can generate consistent structures for information access and retrieval. Figure 1. Different sources of data compete to produce knowledge when processed by a common clustering engine