A Physicochemical Model for Analyzing DNA Sequences
Samrat Dutta, Poonam Singhal, Praveen Agrawal, Raju Tomer, Kritee, Ekta Khurana, and
B. Jayaram*
Department of Chemistry and Supercomputing Facility for Bioinformatics and Computational Biology,
Indian Institute of Technology, Hauz Khas, New Delhi-110016, India
Received April 11, 2005
In search of an ab initio model to characterize DNA sequences as genes and nongenes, we examined some
physicochemical properties of each trinucleotide (codon), which could accomplish this task. We constructed
three-dimensional vectors for each double-helical trinucleotide sequence considering hydrogen-bonding energy,
stacking energy, and a third parameter, which we provisionally identified with DNAsprotein interactions.
As this three-dimensional vector moves along any genome, the net orientation of the resultant vector should
differ significantly for gene and nongene regions to make a distinction feasible, if the underlying model has
some merits. An analysis of 331 prokaryotic genomes comprising a total of 294 786 experimentally verified
genes (nonoverlapping) and an equal number of nongenes presents a proof of concept of the model without
the need for further parametrization. Also, initial analyses on Saccharomyces cereVisiae and Arabidopsis
thaliana suggest that the methodology is extendable to eukaryotes. The physicochemical model (ChemGe-
nome1.0) introduced has the potential to be developed into a gene-finding algorithm and, more pressingly,
could be employed for an independent assessment of the annotation of DNA sequences.
I. INTRODUCTION
The regulation of gene expression is a matter of chemistry
between DNA and proteins at the molecular level. While
remarkable advances have been made over the past two
decades in the analysis of DNA sequences and in gene
prediction in particular, via statistical and mathematical
models and artificial intelligence techniques based on ge-
nome, gene, cDNA, and protein sequence databases and the
clever design of computational protocols,
1-28
an expeditious
in silico gene-finding model which directly captures the
physicochemical properties intrinsic to DNA sequences and
the chemistry of protein-DNA interactions remains a goal
yet to be realized. Proceeding along these lines, we sought
to look for some simplifying universal principles working
behind deciding “what can be a gene” in any species.
Working with the hypothesis that both the structure of the
DNA and its interactions with regulatory proteins and
polymerases decide the function of a DNA sequence, we
developed a simple three-parameter model based on Wat-
son-Crick hydrogen-bonding energy, base-pair stacking
energy, and a third parameter which we provisionally
identified with DNAsprotein interactions. Each of these
parameters acts as a dimension for a three-dimensional unit
vector, whose orientation differs for each trinucleotide. The
premise that the cumulative vectors for gene and nongene
regions should differ in orientation (Figure 1) stands verified
on 331 prokaryotic genomes and 21 eukaryotic genomes.
We introduce, here, the physicochemical model for analyzing
DNA sequences, present a series of validation tests on a large
number of genomes, and examine its merits and limitations
and its potential utility in genome analyses.
II. METHODS
The physicochemical model proposed involves developing
a three-dimensional (3-D) vector for double-helical deoxy-
ribonucleic acid (DNA) base sequences, with each dimension
representing one facet of DNA recognition
29
by proteins.
Each of the 64 trinucleotides is assigned three coordinates,
* Author to whom correspondence should be addressed. Tel.: +91-11-
2659 1505, +91-11-2659 6786. Fax: +91-11-2658 2037. E-mail:
bjayaram@chemistry.iitd.ac.in.
Figure 1. Physicochemical model for analyzing DNA sequences
and the hypothesis for genome characterization as genes and
nongenes.
78 J. Chem. Inf. Model. 2006, 46, 78-85
10.1021/ci050119x CCC: $33.50 © 2006 American Chemical Society
Published on Web 10/12/2005