Improving Mathematical Programming Approaches for Motif Finding Carl Kingsford, Elena Zaslavsky, and Mona Singh Department of Computer Science and Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544 {carlk,elenaz,msingh}@cs.princeton.edu Abstract. The motif ﬁnding problem is to locate a collection of mutu- ally similar subsequences within a given set of DNA sequences. This is an important problem, as such shared motifs often correspond to regula- tory elements. We study a combinatorial framework for the motif ﬁnding problem, where the goal is to ﬁnd a minimum (or maximum) weighted clique in a k-partite graph. Previous approaches to ﬁnd these cliques have relied on graph pruning and divide-and-conquer techniques. Recently, it has been shown that mathematical programming is a promising approach for motif ﬁnding. Here, we describe a novel integer linear programming formulation for the problem. A key observation driving our formulation is that the weights on the edges in the graph come from a small set of possibilities. We show that our new formulation leads to a method that is highly eﬀective in practice on instances arising from biological sequence data. We are able to solve these problems to optimality often many times faster than the existing mathematical programming approach. 1 Introduction A central challenge in post-genomic biology is to reconstruct the regulatory net- work of an organism. A key step in this process is the discovery of regulatory elements. A commonly studied paradigm starts with a set of DNA sequences that contain binding sites for a common regulatory protein, and then ﬁnds shared (or similar) subsequences in each. These subsequences, or motifs, are putative bind- ing sites for the same factor. The eﬀectiveness of identifying regulatory elements in this manner has been demonstrated when considering sets of sequences iden- tiﬁed via shared co-expression [29], orthology [6, 12], and genome-wide location analysis [16]). From a computational point of view, the motif ﬁnding problem can be for- mulated in diﬀerent ways, and while many methods work reasonably well, a recent comprehensive study by [31] shows that no single motif ﬁnding method exhibits a high absolute measure of correctness. Broadly speaking, the methods are either probabilistic or combinatorial. Probabilistic approaches estimate pa- rameters of a motif model using maximum likelihood or maximum a posterior estimation to ﬁnd the parameters of these models [15, 4, 14, 17, 9]. Combinato- rial approaches either enumerate through all allowed motifs (e.g., [30, 27, 18, 32,