1 Abstract— Many of the computational intelligence techniques currently used do not scale well in data type or computational performance, so selecting the right dimensionality reduction technique for the data is essential. By employing a dimensionality reduction technique called representative dissimilarity to create an embedded space, large spaces of complex patterns can be simplified to a fixed-dimensional Euclidean space of points. The only current suggestions as to how the representatives should be selected are principal component analysis, projection pursuit, and factor analysis. Several alternative representative strategies are proposed and empirically evaluated on a set of term vectors constructed from HTML documents. The results indicate that using a representative dissimilarity representation with at least 50 representatives can achieve a significant increase in classification speed, with a minimal sacrifice in accuracy, and when the representatives are selected randomly, the time required to create the embedded space is significantly reduced, also with a small penalty in accuracy. Index Terms—Dimensionality reduction, dissimilarity representation, document classification, representative selection, prototype selection I. INTRODUCTION OST of the computational intelligence (CI) techniques in use have strong mathematical roots, making assumptions that the data is highly structured or organized, and that their relationships are either known or can be measured. Web mining data such as HTML documents tend to be loosely structured. That is, the exact relationships between documents are not fully understood, making the This work was supported in part by the National Institute for Systems Test and Productivity at the University of South Florida under the US Space and Naval Warfare Systems Command Contract No. N00039-02-C-3244 and by the Fulbright Foundation that has granted Prof. Kandel the Fulbright Research Award at Tel-Aviv University, College of Engineering during the academic year 2003-2004. Z. Reynolds is with the Department of Computer Science and Engineering, University of South Florida, Tampa, FL 33620 USA (phone: 813-974-4432; e- mail: zreynold@csee.usf.edu). H. Bunke is with the Inst. für Informatik und angewandte Mathematik, University of Bern, CH-3012 Bern, Switzerland (e-mail: bunke@iam.unibe.ch). M. Last is with the Department of Information Systems and Engineering, Ben-Gurion University of the Negev, Beer-Sheeva 84105, Israel (e-mail: mlast@bgu.ac.il). A. Kandel is with the Department of Computer Science and Engineering, University of South Florida, Tampa, FL 33620 USA (email: kandel@csee.usf.edu). application of computational intelligence techniques more difficult. Content such as audio or video clips, images, and complex commodities, which are highly-variable and more sophisticated than documents, can make the application of computational intelligence techniques awkward or clumsy at best. These data types are high-dimensional; that is, there are many aspects that contribute to defining them. In order for CI methods to handle these high-dimensional data types, their dimensionality needs to be reduced to a manageable level, while preserving as much information as possible. The traditional method employed to manage complex objects is a set of domain-specific descriptive or characteristic features, called feature vectors. Usually, the features that comprise the feature vector are selected by an expert in the particular object domain, and can be quantitative, qualitative, or a mixture of both [6]. This representation of objects is simple, and allows the use of computational intelligence methods that deal with a Euclidean feature space. However, the features must be defined by a domain expert, and even if the expert is able to define relevant and measurable features, hidden dependencies between the features can exist. Sometimes the features are too inefficient to compute, features that seem relevant may have poor discriminatory power, or relevant features are left out all together. Selecting too many features can lead to the ‘dimensionality curse’ problem, in which the performance of the applied computational intelligence technique degrades exponentially as a function of the data dimensionality [12]. In order to overcome these problems, a dissimilarity representation could be employed to create an embedded Euclidean space [7]. The dissimilarity representation describes objects relatively; that is, objects are defined by their distance or dissimilarity to the other objects in the set rather than by the absolute feature values. The objects are mapped into the embedded space by measuring their distances to every other object in the set. Given that n is the number of objects in the set, this mapping requires n 2 distances to be measured, which can be reduced to n(n-1)/2 if the distance function used is symmetric. The computational intelligence techniques could then be easily applied to the embedded space. The main advantages of this method are that there are no domain-specific knowledge requirements other than an object distance metric and the application of a computational intelligence technique to the resulting embedded space is more straight-forward. The main disadvantages of this method are that a distance metric for the objects is required (preferably Comparing Representative Selection Strategies for Dissimilarity Representations Zane Reynolds, Horst Bunke, Mark Last, and Abraham Kandel M