A Physicochemical Model for Analyzing DNA Sequences Samrat Dutta, Poonam Singhal, Praveen Agrawal, Raju Tomer, Kritee, Ekta Khurana, and B. Jayaram* Department of Chemistry and Supercomputing Facility for Bioinformatics and Computational Biology, Indian Institute of Technology, Hauz Khas, New Delhi-110016, India Received April 11, 2005 In search of an ab initio model to characterize DNA sequences as genes and nongenes, we examined some physicochemical properties of each trinucleotide (codon), which could accomplish this task. We constructed three-dimensional vectors for each double-helical trinucleotide sequence considering hydrogen-bonding energy, stacking energy, and a third parameter, which we provisionally identified with DNAsprotein interactions. As this three-dimensional vector moves along any genome, the net orientation of the resultant vector should differ significantly for gene and nongene regions to make a distinction feasible, if the underlying model has some merits. An analysis of 331 prokaryotic genomes comprising a total of 294 786 experimentally verified genes (nonoverlapping) and an equal number of nongenes presents a proof of concept of the model without the need for further parametrization. Also, initial analyses on Saccharomyces cereVisiae and Arabidopsis thaliana suggest that the methodology is extendable to eukaryotes. The physicochemical model (ChemGe- nome1.0) introduced has the potential to be developed into a gene-finding algorithm and, more pressingly, could be employed for an independent assessment of the annotation of DNA sequences. I. INTRODUCTION The regulation of gene expression is a matter of chemistry between DNA and proteins at the molecular level. While remarkable advances have been made over the past two decades in the analysis of DNA sequences and in gene prediction in particular, via statistical and mathematical models and artificial intelligence techniques based on ge- nome, gene, cDNA, and protein sequence databases and the clever design of computational protocols, 1-28 an expeditious in silico gene-finding model which directly captures the physicochemical properties intrinsic to DNA sequences and the chemistry of protein-DNA interactions remains a goal yet to be realized. Proceeding along these lines, we sought to look for some simplifying universal principles working behind deciding “what can be a gene” in any species. Working with the hypothesis that both the structure of the DNA and its interactions with regulatory proteins and polymerases decide the function of a DNA sequence, we developed a simple three-parameter model based on Wat- son-Crick hydrogen-bonding energy, base-pair stacking energy, and a third parameter which we provisionally identified with DNAsprotein interactions. Each of these parameters acts as a dimension for a three-dimensional unit vector, whose orientation differs for each trinucleotide. The premise that the cumulative vectors for gene and nongene regions should differ in orientation (Figure 1) stands verified on 331 prokaryotic genomes and 21 eukaryotic genomes. We introduce, here, the physicochemical model for analyzing DNA sequences, present a series of validation tests on a large number of genomes, and examine its merits and limitations and its potential utility in genome analyses. II. METHODS The physicochemical model proposed involves developing a three-dimensional (3-D) vector for double-helical deoxy- ribonucleic acid (DNA) base sequences, with each dimension representing one facet of DNA recognition 29 by proteins. Each of the 64 trinucleotides is assigned three coordinates, * Author to whom correspondence should be addressed. Tel.: +91-11- 2659 1505, +91-11-2659 6786. Fax: +91-11-2658 2037. E-mail: bjayaram@chemistry.iitd.ac.in. Figure 1. Physicochemical model for analyzing DNA sequences and the hypothesis for genome characterization as genes and nongenes. 78 J. Chem. Inf. Model. 2006, 46, 78-85 10.1021/ci050119x CCC: $33.50 © 2006 American Chemical Society Published on Web 10/12/2005