Efficient Parameterized Algorithms for Biopolymer Structure-Sequence Alignment Yinglei Song, Chunmei Liu, Xiuzhen Huang, Russell L. Malmberg, Ying Xu, and Liming Cai Abstract—Computational alignment of a biopolymer sequence (e.g., an RNA or a protein) to a structure is an effective approach to predict and search for the structure of new sequences. To identify the structure of remote homologs, the structure-sequence alignment has to consider not only sequence similarity, but also spatially conserved conformations caused by residue interactions and, consequently, is computationally intractable. It is difficult to cope with the inefficiency without compromising alignment accuracy, especially for structure search in genomes or large databases. This paper introduces a novel method and a parameterized algorithm for structure-sequence alignment. Both the structure and the sequence are represented as graphs, where, in general, the graph for a biopolymer structure has a naturally small tree width. The algorithm constructs an optimal alignment by finding in the sequence graph the maximum valued subgraph isomorphic to the structure graph. It has the computational time complexity Oðk t N 2 Þ for the structure of N residues and its tree decomposition of width t. Parameter k, small in nature, is determined by a statistical cutoff for the correspondence between the structure and the sequence. This paper demonstrates a successful application of the algorithm to RNA structure search used for noncoding RNA identification. An application to protein threading is also discussed. Index Terms—Structure-sequence alignment, tree decomposition, parameterized algorithm, dynamic programming, RNA structure homology search, protein threading. Ç 1 INTRODUCTION S TRUCTURE-SEQUENCE alignment plays a central role in a number of important computational biology methods. For instance, protein threading, an effective method to predict protein tertiary structure, is based on an alignment between the target sequence and structure templates in a template database [3], [5], [42], [20], [40]. Structure-sequence alignment is also essential to RNA structural homology search, a viable approach to annotating (and identifying new) noncoding RNAs [10], [13], [31], [24]. Structure- sequence alignment also finds applications in other bioin- formatics tasks where structure plays an instrumental role, such as in the identification of the structure of intermole- cular interactions [27], [29] and in the discovery of the structure of biological pathways through comparative genomics [8]. The structure-sequence alignment is to find an optimal way to “fit” the residues of a target sequence in the spatial positions of a structure template. To be able to identify the structure of remote homologs, the alignment has to consider not only sequence similarity but also spatially conserved conformations caused by sophisticated interactions between residues and, consequently, is computationally intractable. For example, it is both NP-hard for protein threading with amino acid interactions [19] and for thermodynamic determination of RNA secondary structure, including pseudoknots [25]. The alignment problem has often been formulated as integer programming that characterizes residue spatial interactions with (a large number of) linear inequality constraints [40], [22]. Commercial software packages for linear programming are usually used to approximate the integer programming and to reduce the computation time. More sophisticated techniques, such as branch-and-cut, can be used to dynamically include only needed linear con- straints [22], [30]. Moreover, a divide-and-conquer method based on the notion of “open-links” has also been devised to address the residue-residue interaction issue [42]. For RNA structure-sequence alignment, dynamic programming has been extended to include crossing patterns of RNA nucleotide interactions [35], [7]. The above algorithmic techniques cope with the intractability of the structure- sequence alignment problem; however, most of them still require computation time polynomial of a high-degree. In this paper, we introduce an efficient structure-sequence alignment algorithm. Both structure and sequence are represented as mixed graphs (containing both directed and undirected edges); the optimal alignment corresponds to finding the maximum valued (subgraph) isomorphism between the structure graph and a subgraph of the sequence graph. In addition, we introduce an integer parameter k to constrain the correspondence between the graphs. A dy- namic programming algorithm is then developed over a tree decomposition of the structure graph. For each value of k, the optimal alignment can be found in time Oðk t N 2 Þ for each structure template containing N residues given a tree decomposition of tree width t for the structure graph. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 3, NO. 4, OCTOBER-DECEMBER 2006 1 . Y. Song, C. Liu, and L. Cai are with the Department of Computer Science, 415 Boyd GSRC, University of Georgia, Athens, GA 30602. E-mail: {song, chunmei, cai}@cs.uga.edu. . X. Huang is with the Department of Computer Science, Arkansas State University, State University, AR 72467. E-mail: xzhuang@csm.astate.edu. . R.L. Malmberg is with the Department of Plant Biology, University of Georgia, Athens, GA 30602-7271. E-mail: russell@plantbio.uga.edu. . Y. Xu is with the Department of Biochemistry and Molecular Biology, A110 Life Sciences Building, University of Georgia, 120 Green Street, Athens, GA 30602. E-mail: xyn@bmb.uga.edu. Manuscript received 14 Feb. 2006; revised 31 May 2006; accepted 15 June 2006; published online 31 Oct. 2006. For information on obtaining reprints of this article, please send e-mail to: tcbb@computer.org, and reference IEEECS Log Number TCBBSI-0015-0206. 1545-5963/06/$20.00 ß 2006 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM