Hindawi Publishing Corporation Comparative and Functional Genomics Volume 2012, Article ID 289694, 11 pages doi:10.1155/2012/289694 Research Article Multidimensional Scaling Applied to Histogram-Based DNA Analysis Ant ´ onio C. Costa, 1 J. A. Tenreiro Machado, 2 and Maria Dulce Quelhas 3 1 Department of Informatics Engineering, Institute of Engineering, Polytechnic of Porto, Rua Dr. Ant´ onio Bernardino de Almeida 431, 4200-072 Porto, Portugal 2 Department of Electrical Engineering, Institute of Engineering, Polytechnic of Porto, Rua Dr. Ant´ onio Bernardino de Almeida 431, 4200-072 Porto, Portugal 3 National Health Institute and Biochemical Genetics Unit, Institute of Medical Genetics Center Jacinto de Magalh˜ aes, Prac ¸a Pedro Nunes 88, 4099-028 Porto, Portugal Correspondence should be addressed to J. A. Tenreiro Machado, jtm@isep.ipp.pt Received 8 December 2011; Revised 19 April 2012; Accepted 21 May 2012 Academic Editor: John Parkinson Copyright © 2012 Ant ´ onio C. Costa et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. This paper aims to study the relationships between chromosomal DNA sequences of twenty species. We propose a methodology combining DNA-based word frequency histograms, correlation methods, and an MDS technique to visualize structural information underlying chromosomes (CRs) and species. Four statistical measures are tested (Minkowski, Cosine, Pearson product-moment, and Kendall τ rank correlations) to analyze the information content of 421 nuclear CRs from twenty species. The proposed methodology is built on mathematical tools and allows the analysis and visualization of very large amounts of stream data, like DNA sequences, with almost no assumptions other than the predeﬁned DNA “word length.” This methodology is able to produce comprehensible three-dimensional visualizations of CR clustering and related spatial and structural patterns. The results of the four test correlation scenarios show that the high-level information clusterings produced by the MDS tool are qualitatively similar, with small variations due to each correlation method characteristics, and that the clusterings are a consequence of the input data and not method’s artifacts. 1. Introduction DNA related information can be analyzed in many diﬀerent ways, including by methods based on “word frequency” histograms derived from DNA sequences [1]. Histograms are a condensed representation of the original information and allow further processing methods, like correlation, which are not viable in the original data. The correlation between histograms can be computed, producing a correlation matrix that can serve as input to other methods for high-level information extraction and tabular/graphical analysis like the multidimensional scaling (MDS) technique, which is able to create low-dimensional representations of complex data while preserving similarities between data points. In [2], the authors describe how the Kendall τ rank correlation method [3] is used to generate the correlation matrix and how a Multidimensional Scaling (MDS) tool [4] is able to generate three-dimensional representations of spatial and structural relationships of the chromosomes and species. In that paper, only one correlation method is applied to the generation of correlation matrices, but many other correlation methods exist and can be used for studying chromosomal/species relationships. As such, we compare and evaluate a set of correlation methods in order to determine if those relationships show up in all methods and are similar. Our main goals are to ﬁnd out if, for each of several correlation methods and word lengths used in the processing of DNA sequences, (a) the MDS tool generates three-dimensional represen- tations featuring spatial and structural patterns;