Computational Statistics & Data Analysis 41 (2002) 91–122 www.elsevier.com/locate/csda Dimension reduction techniques and the classication of bent double galaxies Imola K. Fodor , Chandrika Kamath Center for Applied Scientic Computing, Lawrence Livermore National Laboratory, P.O. BOX 808 L-560, Livermore, CA 94551, USA Abstract As data mining gains acceptance in the analysis of massive data sets, it is becoming clear that there is a need for algorithms that can handle not only the massive size, but also the high dimensionality of the data. Certain pattern recognition algorithms can become computationally intractable when the number of features reaches hundreds or even thousands, while others can break down if there are large correlations among the features. A common solution to these problems is to reduce the dimension, either in conjunction with the pattern recognition algorithm or independent of it. We describe how dimension reduction techniques can be applied in the context of a specic data mining application, namely, the classication of radio-galaxies with a bent double morphol- ogy. We discuss certain statistical and exploratory data analysis methods to reduce the number of features, and the subsequent improvements in the performance of decision tree and generalized linear model classiers. We show that a careful extraction and selection of features is necessary for the successful application of data mining techniques. c 2002 Elsevier Science B.V. All rights reserved. Keywords: Data mining; Exploratory data analysis; Feature selection; Dimension reduction; Classication; Decision trees; Generalized linear models 1. Introduction As commercial and scientic datasets approach the terabyte and even petabyte range, it is no longer possible to manually nd useful information in such data. To address this problem, semi-automated techniques from data mining are increasingly being used as a viable means of analyzing these massive data sets. Data mining is an iterative * Corresponding author. Tel.: +1-925-424-5420; fax: +1-925-422-6287. E-mail address: fodor1@llnl.gov (I.K. Fodor). 0167-9473/02/$-see front matter c 2002 Elsevier Science B.V. All rights reserved. PII:S0167-9473(02)00061-0