REFERENCE MATTERS: AN EFFICIENT AND SCALABLE ALGORITHM FOR LARGE MULTIPLE STRUCTURE ALIGNMENT Jose S. Hleap Khanh N. Nguyen Department of Biochemistry Department of Computer and Molecular Biology Science Dalhousie University Dalhousie University Halifax, NS, B3H 4R2, Canada Halifax, NS, B3H 4R2, Canada jshleap@dal.ca knguyen@cs.dal.ca Alex Safatli Christian Blouin Department of Computer Science Department of Computer Science Dalhousie University Dalhousie University Halifax, NS, B3H 4R2, Canada Halifax, NS, B3H 4R2, Canada asafatli@dal.ca cblouin@cs.dal.ca Abstract A central strategy in structural biology is to find the optimal alignment for a set of many homologous protein structures. This task is not trivial and is still not particularly scalable to large datasets. Sev- eral structural alignment programs have been devel- oped in the last few years. They mainly differ on the definition of homology or the optimization of the fit among structures. We propose here a practical and scalable strategy to align large datasets of pro- tein structures. This strategy is based on aligning n-1 structures against a single reference structure. Here we show that 1) selecting the best reference from a dataset is significant to the overall RMSD of the align- ment; 2) although searching for the best reference in big datasets is O(An 2 ) problem (A being the complex- ity of a pairwise alignment), it is possible to define a heuristic to select a suitable reference structure from a large dataset. We test these using a large dataset, the GP120 family, which is rich in disordered regions. We also tested our strategy for a large number of align- ments from the SABmark benchmark. Both exper- iments showed that our method performs equally or better than traditional Multiple Structure Alignment, while faster and capable to efficiently align a much larger number of homologs. 1 BACKGROUND To understand the evolution of protein struc- ture and function, the alignment of homologous struc- tures is a vital step [9]. Protein structure compar- ison is also important because it allows to explore more distant relationships than it could be done with sequence comparison alone [10]. To date, the gold standards for protein structure comparisons are man- ually curated databases [5] like SCOP [1] or CATH [7], and a more recently developed benchmark, that combines the two (SCOPCath) [6]. However, this is a very time consuming process where discrepancies in methodological assumption may lead to errors in in- terpretation [6]. It is increasingly difficult to manually curate the corpus of protein structures. As of Novem- ber 20 2012, 86344 structures are deposited in the PDB [http://www.rcsb.org/] while SCOP considers only 38221 structures [official release, http://scop.mrc- lmb.cam.ac.uk/scop/], CATH 51334 structures. Struc- ture alignment and assignment of structural similari- ties are not trivial tasks [5], and more work is needed to scale the process of multiple protein structure align- ment. Many approaches to protein structure alignment have been proposed [17]. General strategies such as rigid body superimposition, flexible alignment and elastic aligners are the most widely used [5]. Each software implementation deals with different aspects of the protein alignment problem, but most of them use heuristics to find a suitable alignment, since evaluat- ing similarities across all possible interactions among pairs is an NP-hard problem [4, 9]. Another prob- lem is that most software are highly sensitive on the reference structure [9] seeding the alignment. There- fore, if the “wrong” reference is selected, the overall quality of the multiple structure alignment (MStA) is affected. An issue that can raise with the former prob- lem, is how to find the “correct” or “best” structure. Here we present an algorithm to produce alignment on large sets of homologous structures by finding a rea-