GENOMICS 46, 37–45 (1997) ARTICLE NO. GE974984 A Tool for Analyzing and Annotating Genomic Sequences Xiaoqiu Huang, 1 Mark D. Adams,* Hao Zhou, and Anthony R. Kerlavage* Department of Computer Science, Michigan Technological University, Houghton, Michigan 49931; and * The Institute for Genomic Research, 9712 Medical Center Drive, Rockville, Maryland 20850 Received December 16, 1996; accepted August 15, 1997 percentage of exons exactly identified was less than We describe a tool for analyzing and annotating 50% for most of the programs. The other approach is large genomic sequences containing introns. The based on use of similarities between the DNA sequence analysis and annotation tool (AAT) includes two sets and known protein sequences to identify exons in the of programs, one for comparing the query sequence DNA sequence (Gish and States, 1993; Gelfand et al., with a protein database and the other for comparing 1996; Guan and Uberbacher, 1996; Huang and Zhang, the query with a cDNA database. Each set contains a 1996; Zhang, 1996; Zhang et al., 1997). The Procrustes fast database search program and a rigorous align- program (Gelfand et al., 1996) was specially designed ment program. The database search program quickly for recognizing genes using DNA – protein matches. All identifies regions of the query sequence that are simi- the other programs are general-purpose DNA – protein lar to a database sequence. Then the alignment pro- comparison programs. Alignments produced by the pro- gram constructs an optimal alignment for each region grams show exon–intron boundaries of the DNA se- and the database sequence. The alignment program quence at various levels of resolution, respectively. The also reports the coordinates of exons in the query se- DNA – protein comparison programs differ significantly quence. Pairwise alignments of the query sequence in capability and execution time. The BLASTX pro- with protein and cDNA database sequences are com- gram (Gish and States, 1993) is perhaps the fastest bined into multiple sequence alignments, which pro- program of all. The program compares a DNA sequence vide a view of all protein and cDNA sequences match- with a protein sequence database. It is suitable for ing a query region. On a data set of 570 DNA se- quickly identifying protein database sequences that are quences, AAT identified 94% of coding nucleotides similar to regions of the DNA sequence. In contrast, correctly and 74% of exons exactly. Results of analyz- the NAP program (Huang and Zhang, 1996) is perhaps ing a human BAC sequence with the AAT tool are also presented. The AAT tool reduces the labor-intensive the slowest program of all. The program produces a work of locating the exons of the query sequence and high-resolution alignment between a DNA sequence improves the process of defining intron–exon bound- and a protein sequence. It is suitable for displaying the aries by using the wealth of available protein and similarity correlation between the DNA and the protein cDNA data. 1997 Academic Press sequences and in particular for showing exon – intron boundaries of the DNA sequence at the highest level of resolution. Because of its high execution time re- INTRODUCTION quirement, it is not possible to use NAP on an ordinary computer to compare the DNA sequence with each pro- Analysis and annotation of a newly determined DNA tein sequence in the database. The remaining pro- sequence involve identifying the protein-coding regions grams fall between the two programs in capability and of the sequence. The problem of identifying a coding execution time. region with multiple exons amounts to determining the In this paper, we expand our work on the NAP pro- exact boundaries of each exon of the coding region. gram by developing an analysis and annotation tool There are two complementary approaches to gene iden- (AAT) for identifying the coding regions of the DNA tification. One approach is based on use of sequence sequence that are similar to protein or cDNA sequences statistics to predict exons in the DNA sequence (see in the databases. The AAT tool makes it possible to Fickett, 1996 for reviews). Burset and Guigo (1996) produce accurate results at an affordable speed on an recently evaluated seven gene prediction programs on ordinary computer. The tool includes two sets of pro- a data set of 570 sequences and found that the average grams, one for comparing the query sequence with a protein database and the other for comparing the query 1 To whom correspondence should be addressed at Department of with a cDNA database. Each set contains a fast data- Computer Science, Michigan Technological University, 1400 Town- base search program and a rigorous alignment pro- send Drive, Houghton, MI 49931. Telephone: (906) 487-2123. Fax: (906) 487-2283. E-mail: huang@cs.mtu.edu. gram. The database search program quickly identifies 37 0888-7543/97 $25.00 Copyright 1997 by Academic Press All rights of reproduction in any form reserved.