Dana Haley-Vicente, Velin Spassov, Tina Yeh, Ken Butenhof, Christoph Schneider, Azat Badretdinov, Lisa Yan Accelrys Inc., 9685 Scranton Road San Diego, CA 92121 Structural Prediction and Functional Annotation of Proteomic Sequences using GeneAtlas TM We have used GeneAtlas™ to provide functional annotation of proteomic sequence data including structural prediction. GeneAtlas is an automated, high- throughput pipeline for the prediction of protein structure and function using sequence similarity detection, homology modeling, and fold recognition methods. Using template searching, GeneAtlas searches for relationships between query sequences and known protein structures, motifs, and folds. Subsequent inferences and assignment of the target protein’s function is based on its homology to the experimentally derived template protein and the models generated as part of the pipeline. Using CASP5 targets as query sequences, we demonstrate that GeneAtlas detects additional relationships, via its high-throughput modeling component, in comparison with the sequence searching method PSI-BLAST only. Furthermore, functionally related proteins with sequence identity below the twilight zone can be recognized correctly. In addition, some targets were selected to test two new methods that we have developed, ChiRotor and Looper, for side-chain and loop prediction. ChiRotor is a fast algorithm that predicts the conformation of all or part of amino-acid side chains with an average RMSD of about 1Å for the core residues. The loop-modeling program, Looper, produces a number of energy minimized loop backbone conformations ranked according to force-field energy terms. Both algorithms are a combination of a discrete search in dihedral angle space and CHARMm energy minimization. GeneAtlas™ is an automated protein annotation pipeline for analyzing protein sequences and identifying their biochemical function. The GeneAtlas pipeline automates and integrates several steps into one seamless operation, collapsing the genomic information explosion and converting it into information and knowledge. In Figure 1 below, the protein sequences are run through a series of methods. 1 Domain Analysis: For sequence domain analysis we use the Hidden Markov Model (HMMer) algorithm to identify to perform a comparison to PFAM. Similarity Search: Before the search for similar sequences, a number of filters are applied including the masking of low sequence complexity regions. The sequence similarity searching component is comprised of a modified version of PSI-BLAST including a forward and reverse search method. Optimization of this component has been performed in a variety of ways to minimize the rate of false positives. 1 High throughput Modeling: There are several steps in this method, which are based on the work of Dr. Andrej Šali and his lab at Rockefeller University. 2 GeneAtlas: High Throughput Functional Annotation Pipeline A putative homology relationship between a query sequence and a template from PDB is confirmed on the basis of the quality of the resulting homology model rather than solely on the basis of the level of sequence identity between the sequence and template. Accelrys' PSI-BLAST protocol is used to search between query sequences and known protein structures stored in the RCSB (PDB) database. Protein models are generated using MODELER with the PSI-BLAST alignment. The last step is validation of the models where currently, GeneAtlas employs both the patented technology of Profiles 3D/Verify. 3 in addition to an algorithm developed by Andrej Šali to test whether the protein is reasonably folded and is a valid model. Figure 1: Schematic representation of the GeneAtlas™ pipeline, a high throughput pipeline for functional annotation of protein sequences. The resulting annotations are stored into Discovery Studio AtlasStore™. Protein Sequence Data DS AtlasStore™ SeqFold High- Throughput Modeling Similarity Search 3D Annotations Domain Analysis (HMM) SeqFold Is a fold recognition method from Accelrys, originally developed in the laboratory of Dr. David Eisenberg. 4 As with the similarity search method, optimization of this component has been performed in a variety of ways to minimize the rate of false positives. 1 Annotations In addition to including the active site annotation from the PDB SITE record, Accelrys has developed algorithms for the location of potential binding sites on the basis of a structural template method, which identify three-dimensional features known to confer function 3D- (e.g. serine protease catalytic triad, metal binding site, ATP binding site). The extracted structural patterns form a library of 3D pharmacophore-style templates that can in turn be used to characterize new protein structures in a manner similar to how Prosite is used to characterize sequences. DS AtlasStore™: The results are initially output in flat file format and then loaded into DS AtlasStore™, designed to store protein sequences, 3-D structures, and related functional annotations that have been derived using the methods contained in the GeneAtlas pipeline.