INFORMS Journal on Computing Vol. 16, No. 4, Fall 2004, pp. 393–405 issn 0899-1499 eissn 1526-5528 04 1604 0393 inf orms ® doi 10.1287/ijoc.1040.0092 ©2004 INFORMS Protein Threading: From Mathematical Models to Parallel Implementations Rumen Andonov LAMIH/ROI, Université de Valenciennes, Le Mont Houy, 59313 Valenciennes Cedex 9, France, Rumen.Andonov@univ-valenciennes.fr Stefan Balev Laboratoire d’Informatique du Havre EA 3219, Université du Havre, 25 rue Philippe Lebon, BP 540, 76058 Le Havre cedex, France, stefan.balev@univ-lehavre.fr Nicola Yanev Faculty of Mathematics and Computer Science, University of Sofia, 5, J. Bouchier str., 1126 Sofia, Bulgaria, choby@math.bas.bg T his paper presents a new network-flow formulation for the problem of predicting 3D protein structures using threading. Several integer-programming models based on this formulation are proposed and compared. These models allow for an efficient decomposition and for the application of a parallel branch-and-cut algo- rithm, significantly reducing the running time. The efficiency of our approach has been confirmed by extensive computational experiments. Key words : protein-threading problem; network optimization; integer programming; linear programming; large-scale problems; parallel algorithms History : Accepted by Harvey J. Greenberg, Guest Editor; received August 2003; revised November 2003; accepted February 2004. 1. Introduction The protein-threading problem (PTP) is widely recog- nized as one of the most important challenges in computational biology (Lathrop et al. 1998, Xu et al. 1998, Head-Gordon and Wooley 2001, Setubal and Meidanis 1997, Lengauer 2001). The problem consists of testing whether a target sequence query is likely to fold into a 3D template structure core by searching for an alignment that minimizes a suitable score function. Proposed and accepted as an alternative to the costly and time-consuming direct methods for 3D structure recognition—such as x-ray crystallography or nuclear magnetic resonance spectroscopy—protein threading can be used for both structure and folding prediction. However until recently, its use was seriously limited by the main component of the approach, an NP-hard optimization problem. The following is a more formal presentation of the problem, which simultaneously introduces some existing terminology. A core is an ordered set of m segments. The length of a segment i is l i charac- ters. The core must be aligned to a query sequence Q of length N characters over the 20-character amino-acid alphabet. Let t i be the starting position of segment i in Q. An alignment is called a feasible threading if: •1 t 1 , t i1 + l i1 t i , i = 2m, t m N + 1 l m (the segments preserve their order and do not over- lap), and • the lengths of uncovered character areas (called gaps or loops) g 1 = t 1 1, g i = t i t i1 l i1 , i = 2m, g m+1 = N + 1 t m l m , are bounded, say g min i g i g max i , i = 1m + 1. For the sake of simplicity, but without loss of gener- ality, suppose that g min i = 0, g max i =, for i = 1 m + 1. Using the relative positions r i = t i i1 j =1 l i , i = 1m, the possible positions of each segment are between 1 and n = N + 1 m i=1 l i . The feasi- bility of a threading is provided by 1 r 1 ≤ ··· ≤ r m n. The quality (sometimes called energy) of a thread- ing is measured by an additive function depending on the position of each segment and on the interac- tions between the pairs of segments belonging to a given set L ij 1 i<j m (the latter makes the problem intractable). The score of an alignment can be expressed by two groups of coefficients: c ik i = 1m, k = 1n, the score for plac- ing segment i on position k; 393