INFORMS Journal on Computing
Vol. 16, No. 4, Fall 2004, pp. 393–405
issn 0899-1499 eissn 1526-5528 04 1604 0393
inf orms
®
doi 10.1287/ijoc.1040.0092
©2004 INFORMS
Protein Threading: From Mathematical Models to
Parallel Implementations
Rumen Andonov
LAMIH/ROI, Université de Valenciennes, Le Mont Houy, 59313 Valenciennes Cedex 9, France,
Rumen.Andonov@univ-valenciennes.fr
Stefan Balev
Laboratoire d’Informatique du Havre EA 3219, Université du Havre, 25 rue Philippe Lebon, BP 540,
76058 Le Havre cedex, France, stefan.balev@univ-lehavre.fr
Nicola Yanev
Faculty of Mathematics and Computer Science, University of Sofia, 5, J. Bouchier str.,
1126 Sofia, Bulgaria, choby@math.bas.bg
T
his paper presents a new network-flow formulation for the problem of predicting 3D protein structures using
threading. Several integer-programming models based on this formulation are proposed and compared.
These models allow for an efficient decomposition and for the application of a parallel branch-and-cut algo-
rithm, significantly reducing the running time. The efficiency of our approach has been confirmed by extensive
computational experiments.
Key words : protein-threading problem; network optimization; integer programming; linear programming;
large-scale problems; parallel algorithms
History : Accepted by Harvey J. Greenberg, Guest Editor; received August 2003; revised November 2003;
accepted February 2004.
1. Introduction
The protein-threading problem (PTP) is widely recog-
nized as one of the most important challenges in
computational biology (Lathrop et al. 1998, Xu et al.
1998, Head-Gordon and Wooley 2001, Setubal and
Meidanis 1997, Lengauer 2001). The problem consists
of testing whether a target sequence query is likely to
fold into a 3D template structure core by searching for
an alignment that minimizes a suitable score function.
Proposed and accepted as an alternative to the costly
and time-consuming direct methods for 3D structure
recognition—such as x-ray crystallography or nuclear
magnetic resonance spectroscopy—protein threading
can be used for both structure and folding prediction.
However until recently, its use was seriously limited
by the main component of the approach, an NP-hard
optimization problem.
The following is a more formal presentation of
the problem, which simultaneously introduces some
existing terminology. A core is an ordered set of
m segments. The length of a segment i is l
i
charac-
ters. The core must be aligned to a query sequence Q of
length N characters over the 20-character amino-acid
alphabet. Let t
i
be the starting position of segment i
in Q. An alignment is called a feasible threading if:
•1 ≤ t
1
, t
i−1
+ l
i−1
≤ t
i
, i = 2m, t
m
≤ N + 1 − l
m
(the segments preserve their order and do not over-
lap), and
• the lengths of uncovered character areas (called
gaps or loops) g
1
= t
1
− 1, g
i
= t
i
− t
i−1
− l
i−1
, i = 2m,
g
m+1
= N + 1 − t
m
− l
m
, are bounded, say g
min
i
≤ g
i
≤
g
max
i
, i = 1m + 1.
For the sake of simplicity, but without loss of gener-
ality, suppose that g
min
i
= 0, g
max
i
=, for i = 1
m + 1. Using the relative positions r
i
= t
i
−
∑
i−1
j =1
l
i
,
i = 1m, the possible positions of each segment
are between 1 and n = N + 1 −
∑
m
i=1
l
i
. The feasi-
bility of a threading is provided by 1 ≤ r
1
≤ ··· ≤
r
m
≤ n.
The quality (sometimes called energy) of a thread-
ing is measured by an additive function depending
on the position of each segment and on the interac-
tions between the pairs of segments belonging to a
given set L ⊆ ij 1 ≤ i<j ≤ m (the latter makes
the problem intractable). The score of an alignment
can be expressed by two groups of coefficients:
• c
ik
i = 1m, k = 1n, the score for plac-
ing segment i on position k;
393