TEXT CATEGORIZATION OF COMMERCIAL WEB PAGES E. Binaghi, M. Carullo, I. Gallo and M. Madaio Universit´ a degli Studi dell’Insubria Via Mazzini 5, 21100 Varese, Italy email: elisabetta.binaghi@uninsubria.it ABSTRACT In this paper we describe a new on-line document catego- rization strategy that can be integrated within Web applica- tions. A salient aspect is the use of neural learning in both representation and classification tasks. Within text docu- ments conceived as images, the regions of interest (RoI) containing information meaningful for categorization are identified with the support of a supervised neural network. Text within RoI is represented according to a simple solu- tion that consider the first K words in the text and code them properly. A Kohonen Self-Organizing Map (SOM) is ap- plied to cluster documents that are subsequently labelled by applying a simple majority voting mechanism. Solutions adopted were evaluated by conducting experiments within the context of on-line price comparison services. Results obtained demontrate that the overall classification strategy is able to categorize documents satisfectorily taking into account the high variability of Web pages. KEY WORDS Text categorization, Kohonen Self-Organizing Map, neural network, multilayer perceptron 1 Introduction The World Wide Web is a great resource for all type of information and offers a high potential of efficient on-line services in several application domains. The on-line availability of ever larger numbers of commercial information, for example, creates the premise for profitable price comparison services allowing individual to see lists of prices for specific products. However, the rapid growth of heterogeneous information usually coded in a relatively free text format poses a chal- lenge to information management solutions that become more and more expensive and frustrating. The problem can be addressed by conceptually organizing the huge amount of data, forming content-based categories within which search and/or mining tools can be efficiently applied. Automated Text Categorization (ATC) is a long-term research topic dealing with the task of building software capable of classifying text (or hypertext) documents under predefined categories. ATC techniques are the premise for improving web search engines in finding relevant documents and Web mining application [3]. There are several Machine Learning (ML) algorithms that have been successfully applied to text categorization. They include Neural Networks, Na¨ ıf Bayes, Support Vector Machine and k-Nearest Neighbors. Each of these methods has their advantages and limitations; the choice of the categorization algorithm depends upon many factors such as scale and dimensionality [6]. Considerable interest has been devoted to Self-Organized Maps [4] that approximate an unlimited number of input data items by a finite set of models. This property makes the SOM useful for organizing large collections of data in general, including document collections. The representation of documents is a central issue in all of the approaches in ATC having a strong impact on the generalization accuracy of a learning system. The representation should be suitable for the classification task and for the specific learning algorithm adopted. For most of classification techniques, the documents, which typically are strings of characters, have to be transformed in quantitative, attribute-values patterns. This requirement agrees with the traditional document encoding method in Information Retrieval, called Salton’s vector space model [5], which is based on the computation of the frequency of occurrence of each word in a document and its collection into a vector. This method has been widely and successfully used in several categorization tasks based on different learning strategies. However, it is impracticable to encode the documents in a large collection using the vector space model as such. Other techniques were proposed in literature alternative to the original vector space model or complementary. Heuristic problem driven preprocessing and/or feature selection strategies are included in an attempt to reduce the size of the vector space [3, 2]. The objective of our work is to design and implement a new on-line document categorization strategy named Tc- system. It makes use of neural learning in both represen- tation and classification tasks. Web pages are conceived as images: within the overall image, the region of inter- est containing information meaningful for categorization is identified with the support of a supervised neural net- work. Text within RoI is represented according to a sim- ple solution that consider the first k words in the text and code them properly. A second task based on Kohonen Self- Organizing Map is applied to cluster documents; clusters 1