Improving Mathematical Programming Approaches for Motif Finding Carl Kingsford, Elena Zaslavsky, and Mona Singh Department of Computer Science and Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544 {carlk,elenaz,msingh}@cs.princeton.edu Abstract. The motif finding problem is to locate a collection of mutu- ally similar subsequences within a given set of DNA sequences. This is an important problem, as such shared motifs often correspond to regula- tory elements. We study a combinatorial framework for the motif finding problem, where the goal is to find a minimum (or maximum) weighted clique in a k-partite graph. Previous approaches to find these cliques have relied on graph pruning and divide-and-conquer techniques. Recently, it has been shown that mathematical programming is a promising approach for motif finding. Here, we describe a novel integer linear programming formulation for the problem. A key observation driving our formulation is that the weights on the edges in the graph come from a small set of possibilities. We show that our new formulation leads to a method that is highly effective in practice on instances arising from biological sequence data. We are able to solve these problems to optimality often many times faster than the existing mathematical programming approach. 1 Introduction A central challenge in post-genomic biology is to reconstruct the regulatory net- work of an organism. A key step in this process is the discovery of regulatory elements. A commonly studied paradigm starts with a set of DNA sequences that contain binding sites for a common regulatory protein, and then finds shared (or similar) subsequences in each. These subsequences, or motifs, are putative bind- ing sites for the same factor. The effectiveness of identifying regulatory elements in this manner has been demonstrated when considering sets of sequences iden- tified via shared co-expression [29], orthology [6, 12], and genome-wide location analysis [16]). From a computational point of view, the motif finding problem can be for- mulated in different ways, and while many methods work reasonably well, a recent comprehensive study by [31] shows that no single motif finding method exhibits a high absolute measure of correctness. Broadly speaking, the methods are either probabilistic or combinatorial. Probabilistic approaches estimate pa- rameters of a motif model using maximum likelihood or maximum a posterior estimation to find the parameters of these models [15, 4, 14, 17, 9]. Combinato- rial approaches either enumerate through all allowed motifs (e.g., [30, 27, 18, 32,