MODELING AND SOLVING STRING SELECTION PROBLEMS C.N. MENESES, P.M. PARDALOS, M.G.C. RESENDE, AND A. VAZACOPOULOS ABSTRACT. We consider four important combinatorial problems that arise in computa- tional biology and show how they can be modeled as integer programming problems. These models are then solved using branch-and-bound algorithms. Computational experi- ments using real and simulated data are performed and the effectiveness of the algorithms is analyzed. 1. I NTRODUCTION In this paper we study four string selection problems that arise in computational biology applications. We show how to model these problems using Integer Programming (IP) and we carry out computational experiments using these models. The experimental results show the effectiveness of IP techniques for solving problems in computational biology. For all tested instances, it was possible to solve them to optimality in a few minutes in a personal computer. In section 2 we define the problems studied and in section 3 we introduce notation. In section 4 we formulate each problem described in section 2 using integer programming. In section 5 we present computational experiments over real and simulated instances. Finally, in section 6 conclusion remarks are given. 2. DEFINITION OF THE PROBLEMS In this section we define the problems studied in this paper. For any two strings s and t of same length (i.e., |s| = |t |) we denote by d H (s, t ) the Hamming distance between them, which is defined as the number of mismatched positions. For example, if s =“ACT” and t =“CCA”, then d H (s, t )= 2. Closest Substring Problem (CSSP) Instance: Given a finite set S c = {s 1 , s 2 ,..., s n } of strings of length at least m over an alphabet A . Objective: Find a string x of length m over A minimizing d c such that for every string s i in S c , d H (x, y) ≤ d c holds for some length-m substring y of s i . Farthest Substring Problem (FSSP) Instance: Given a finite set S f = {s 1 , s 2 ,..., s n } of strings of length at least m over an alphabet A . Date: September 20, 2005. Key words and phrases. Computational biology, integer programming, optimal solution. AT&T Labs Research Technical Report TD-6GEQ9K. This research has been partially supported by NSF, NIH and CRDF grants. Cl´ audio N. Meneses was supported in part by the Brazilian Federal Agency for Higher Education (CAPES) – Grant No. 1797-99-9. To appear in BIOMAT 2005. 1