Appl Bioinformatics 2004; 3 (2-3): 149-158 ORIGINAL RESEARCH 1175-5636/04/0002-0149/$31.00/0 2004 Adis Data Information BV. All rights reserved. MSAT A Multiple Sequence Alignment Tool Based on TOPS Te Ren, Mallika Veeramalai, Aik Choon Tan and David Gilbert Department of Computer Science, Bioinformatics Research Centre, University of Glasgow, Glasgow, UK This article describes the development of a new method for multiple sequence alignment based on fold-level Abstract protein structure alignments, which provides an improvement in accuracy compared with the most commonly used sequence-only-based techniques. This method integrates the widely used, progressive multiple sequence alignment approach ClustalW with the Topology of Protein Structure (TOPS) topology-based alignment algorithm. The TOPS approach produces a structural alignment for the input protein set by using a topolo- gy-based pattern discovery program, providing a set of matched sequence regions that can be used to guide a sequence alignment using ClustalW. The resulting alignments are more reliable than a sequence-only alignment, as determined by 20-fold cross-validation with a set of 106 protein examples from the CATH database, distributed in seven superfold families. The method is particularly effective for sets of proteins that have similar structures at the fold level but low sequence identity. The aim of this research is to contribute towards bridging the gap between protein sequence and structure analysis, in the hope that this can be used to assist the understanding of the relationship between sequence, structure and function. The tool is available at http:// balabio.dcs.gla.ac.uk/msat/. The number of known structures in the Protein Data Bank as our primary data source, but our approach can be adapted to use other structural classification schemes. (PDB) is increasing rapidly, due in particular to the aim of the Most computational tools developed for protein fold prediction structural genomics [1,2] consortium to populate protein fold space are primarily based on sequence identity. If a new protein se- using high-throughput experimental technologies. Most of these quence with an unknown structure has high sequence identity to a projects focus on proteins whose fold cannot be easily recognised protein of known structure, then the new protein may share a by simple sequence comparison with proteins of known struc- similar fold with this structure. Closely related proteins can be ture. [3] Recent research in structural biology has contributed to our detected by comparing their sequences using standard tools from understanding of the relationships between amino acid sequences bioinformatics such as BLAST , [7] PSI-BLAST [8] and FASTA. [9] and protein structures and between different protein structures. [1] To obtain the maximum benefit from the wealth of known One well accepted observation is that the structure of a protein is protein structures, fast and sensitive methods should be used to more conserved than its underlying amino acid sequence. Hence, classify the PDB into fold families. Unfortunately, structure com- learning the similarities (or differences) between protein structures parison based on 3-dimensional (3-D) coordinates is expensive in is very important in understanding the relationship between prote- terms of computational power and time. In order to reduce the in sequence, structure and function, and for the analysis of possible computational time, several heuristic methods have been proposed evolutionary relationships. in the literature. TOPS is one such approach that employs machine Several protein structure databases such as SCOP, [4] CATH [5] learning and heuristic algorithms to discover common structural and DALI, [6] ranging from being manually curated to fully auto- patterns (or motifs) and enables these patterns to be matched to a mated, have been created to further our understanding about the set of TOPS descriptions. [10-12] The advantage of this system is the relationships between protein sequence and structure. We em- simplicity of the representation of the protein structure in the ployed the CATH classification scheme and domain assignments topological model, where, at this level of abstraction, only the