DOI: 10.1007/s10910-006-9118-5 Journal of Mathematical Chemistry, Vol. 40, No. 1, July 2006 (© 2006) A method for clustering and screening of long-dimensional chemical data based on fingerprints and similarity measurements Manuel Urbano Cuadrado, Gonzalo Cerruela Garc´ ıa, Irene Luque Ruiz, and Miguel ´ Angel G´ omez-Nieto Department of Computing and Numerical Analysis, University of C´ ordoba, Campus Universitario de Rabanales, Albert Einstein Building, E-14071 C´ ordoba, Spain E-mail: ma1lurui@uco.es. A method for the treatment of long-dimensional chemical data arrays is presented in this work with the aim of maximising classification models. The method is based on the construction of fingerprints and the subsequent generation of a similarity matrix. The similarity calculation has been modified through a scaling process to take into account different significance shown by the variables. The method was applied to spectral mea- surements of wines and several aspects were studied, namely: threshold considered in the construction of fingerprints and patterns, weighting factor used for scaling, normal- isation method, etc. The application of both Principal Components Analysis and Soft- Independent Modelling of Class Analogies to the similarity matrices gave better classi- fications of the information than those obtained using original data. KEY WORDS: data preparation, similarity calculation, fingerprints, clustering, screening MSC 2000: 68T10, 62H30, 93C35 1. Introduction Data employed for modelling of natural or artificial processes can be obtained from scientific and engineering experiments. Modern instrumental tech- niques provide scientists with Long-Dimensional Data Arrays (LDDA) in short intervals of time. The information able of extracting from these LDDAs depends considerably on the applicability of mathematical and statistical methods to these data sets. Multivariate analysis is the statistical discipline that encompasses methods dealing with the study of phenomena or objects characterised by n observations or properties, respectively [1,2]. Supervised Pattern Recognition Techniques (SPRT). The assignment of an object O i into a given class C j can be expressed as a function of a set of Corresponding author. 15 0259-9791/06/0700-0015/0 © 2006 Springer Science+Business Media, Inc.