Search for a Star: Approximate Gene Cluster Discovery Problem (AGCDP) as Minimization Problem on Graph Jeffrey A. Aborot, Henry Adorna, Jhoirene B. Clemente, Brian Kenneth de Jesus and Geoffrey Solano Algorithms and Complexity Laboratory Department of Computer Science College of Engineering University of the Philippines Diliman jeffrey.aborot@up.edu.ph, ha@dcs.upd.edu.ph, jbclemente@up.edu.ph, badejesus@up.edu.ph, gasolano@up.edu.ph ABSTRACT Finding gene clusters in genomes is an essential process in establishing relationship among organisms. Gene clusters may express functional dependencies among genes and may give insight into expression of speciﬁc traits. The problem of ﬁnding gene clusters among several genomes is referred to as Gene Cluster Discovery and several models has already been formulated for its deﬁnition. One formulation of this prob- lem is the Approximate Gene Cluster Discovery Problem (AGCDP) which is modelled as a combinatorial optimiza- tion problem in some works. In this paper we propose an approach which produces a transformation of AGCDP into a minimum-weight star ﬁnding problem in graph. Detailed examples are also presented to further clarify the notion of the transformation. Proof of equivalence is also presented in the paper to show the equivalence of input parameters of AGCDP and the construction of the graph representing the input parameters to the problem. Keywords Gene, genome, gene cluster, gene content, linear interval, genome, minimization, minimum-weight star, combinatorial optimization 1. INTRODUCTION Gene clusters are set of genes that are closely related to each other. Genes belonging to a cluster may share func- tional dependencies and may be involved in the expression of a speciﬁc trait. Identifying gene clusters is also an essen- tial step in establishing relationships between organisms as well as discovery of drug and treatments for diseases. The problem of identifying this set of genes is called Gene Cluster Discovery. This problem has been modelled several times, examples of which are presented in [3], [4], [5], where genes are modelled as integers and genomes are either permuta- tions or sequences deﬁned over the set of all genes. Models in [3] also takes into account gene clusters with (max-gap clusters) and without gaps (exact clusters). The focus of this work is on the model presented in [5], where they deﬁne Approximate Gene Cluster Discovery Problem (AGCDP) as a combinatorial problem which identiﬁes the set of genes that are kept “more or less” together across genome sequences. An Integer Linear Programming (ILP) formulation is also presented in [5]. Several modiﬁcations of the model, speciﬁcally on the objective function, is also pre- sented to take into account characteristics of real biological data. Among these includes, absence of gene cluster occur- rence in some of the input genomes, identiﬁcation of valid gene clusters, and use of certain reference genome. In this paper we will represent AGCDP as a graph problem. We will deﬁne how we transformed the set of inputs to a speciﬁc graph called GACGDP . Then we will discussed how the problem is reduced to ﬁnding minimum weight star(u) in a graph. Two cases were both modelled in this paper. We consider scenarios with and without a given reference genome. This paper is organized as follows. Section 2 presents a brief discussion of AGCDP as well as the naive and ILP formulation of the problem. Section 3 contains the detailed discussion of how AGCDP is represented as a graph problem. Proof of equivalence of the two representations is discussed in Section 4. Finally Section 5 concludes the paper. 2. APPROXIMATE GENE CLUSTER DIS- COVERY PROBLEM Necessary for our understanding of the problem are the fol- lowing deﬁnitions. 1. Gene A gene is represented by an integer g ∈Z 0 . Special genes represented by the integer 0 are genes with non existing homologs, with which we are not interested of in this problem. 2. Gene Universe The set of all unique genes is called the gene universe and is denoted by U = {0, 1, 2,...,N }.