Hindawi Publishing Corporation
Comparative and Functional Genomics
Volume 2012, Article ID 289694, 11 pages
doi:10.1155/2012/289694
Research Article
Multidimensional Scaling Applied to Histogram-Based
DNA Analysis
Ant ´ onio C. Costa,
1
J. A. Tenreiro Machado,
2
and Maria Dulce Quelhas
3
1
Department of Informatics Engineering, Institute of Engineering, Polytechnic of Porto, Rua Dr. Ant´ onio Bernardino de Almeida 431,
4200-072 Porto, Portugal
2
Department of Electrical Engineering, Institute of Engineering, Polytechnic of Porto, Rua Dr. Ant´ onio Bernardino de Almeida 431,
4200-072 Porto, Portugal
3
National Health Institute and Biochemical Genetics Unit, Institute of Medical Genetics Center Jacinto de Magalh˜ aes,
Prac ¸a Pedro Nunes 88, 4099-028 Porto, Portugal
Correspondence should be addressed to J. A. Tenreiro Machado, jtm@isep.ipp.pt
Received 8 December 2011; Revised 19 April 2012; Accepted 21 May 2012
Academic Editor: John Parkinson
Copyright © 2012 Ant ´ onio C. Costa et al. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.
This paper aims to study the relationships between chromosomal DNA sequences of twenty species. We propose a methodology
combining DNA-based word frequency histograms, correlation methods, and an MDS technique to visualize structural
information underlying chromosomes (CRs) and species. Four statistical measures are tested (Minkowski, Cosine, Pearson
product-moment, and Kendall τ rank correlations) to analyze the information content of 421 nuclear CRs from twenty species.
The proposed methodology is built on mathematical tools and allows the analysis and visualization of very large amounts of stream
data, like DNA sequences, with almost no assumptions other than the predefined DNA “word length.” This methodology is able to
produce comprehensible three-dimensional visualizations of CR clustering and related spatial and structural patterns. The results
of the four test correlation scenarios show that the high-level information clusterings produced by the MDS tool are qualitatively
similar, with small variations due to each correlation method characteristics, and that the clusterings are a consequence of the
input data and not method’s artifacts.
1. Introduction
DNA related information can be analyzed in many different
ways, including by methods based on “word frequency”
histograms derived from DNA sequences [1]. Histograms are
a condensed representation of the original information and
allow further processing methods, like correlation, which
are not viable in the original data. The correlation between
histograms can be computed, producing a correlation matrix
that can serve as input to other methods for high-level
information extraction and tabular/graphical analysis like
the multidimensional scaling (MDS) technique, which is
able to create low-dimensional representations of complex
data while preserving similarities between data points. In
[2], the authors describe how the Kendall τ rank correlation
method [3] is used to generate the correlation matrix and
how a Multidimensional Scaling (MDS) tool [4] is able to
generate three-dimensional representations of spatial and
structural relationships of the chromosomes and species.
In that paper, only one correlation method is applied to
the generation of correlation matrices, but many other
correlation methods exist and can be used for studying
chromosomal/species relationships. As such, we compare
and evaluate a set of correlation methods in order to
determine if those relationships show up in all methods and
are similar. Our main goals are to find out if, for each of
several correlation methods and word lengths used in the
processing of DNA sequences,
(a) the MDS tool generates three-dimensional represen-
tations featuring spatial and structural patterns;