Search for a Star: Approximate Gene Cluster Discovery Problem (AGCDP) as Minimization Problem on Graph Jeffrey A. Aborot, Henry Adorna, Jhoirene B. Clemente, Brian Kenneth de Jesus and Geoffrey Solano Algorithms and Complexity Laboratory Department of Computer Science College of Engineering University of the Philippines Diliman jeffrey.aborot@up.edu.ph, ha@dcs.upd.edu.ph, jbclemente@up.edu.ph, badejesus@up.edu.ph, gasolano@up.edu.ph ABSTRACT Finding gene clusters in genomes is an essential process in establishing relationship among organisms. Gene clusters may express functional dependencies among genes and may give insight into expression of specific traits. The problem of finding gene clusters among several genomes is referred to as Gene Cluster Discovery and several models has already been formulated for its definition. One formulation of this prob- lem is the Approximate Gene Cluster Discovery Problem (AGCDP) which is modelled as a combinatorial optimiza- tion problem in some works. In this paper we propose an approach which produces a transformation of AGCDP into a minimum-weight star finding problem in graph. Detailed examples are also presented to further clarify the notion of the transformation. Proof of equivalence is also presented in the paper to show the equivalence of input parameters of AGCDP and the construction of the graph representing the input parameters to the problem. Keywords Gene, genome, gene cluster, gene content, linear interval, genome, minimization, minimum-weight star, combinatorial optimization 1. INTRODUCTION Gene clusters are set of genes that are closely related to each other. Genes belonging to a cluster may share func- tional dependencies and may be involved in the expression of a specific trait. Identifying gene clusters is also an essen- tial step in establishing relationships between organisms as well as discovery of drug and treatments for diseases. The problem of identifying this set of genes is called Gene Cluster Discovery. This problem has been modelled several times, examples of which are presented in [3], [4], [5], where genes are modelled as integers and genomes are either permuta- tions or sequences defined over the set of all genes. Models in [3] also takes into account gene clusters with (max-gap clusters) and without gaps (exact clusters). The focus of this work is on the model presented in [5], where they define Approximate Gene Cluster Discovery Problem (AGCDP) as a combinatorial problem which identifies the set of genes that are kept “more or less” together across genome sequences. An Integer Linear Programming (ILP) formulation is also presented in [5]. Several modifications of the model, specifically on the objective function, is also pre- sented to take into account characteristics of real biological data. Among these includes, absence of gene cluster occur- rence in some of the input genomes, identification of valid gene clusters, and use of certain reference genome. In this paper we will represent AGCDP as a graph problem. We will define how we transformed the set of inputs to a specific graph called GACGDP . Then we will discussed how the problem is reduced to finding minimum weight star(u) in a graph. Two cases were both modelled in this paper. We consider scenarios with and without a given reference genome. This paper is organized as follows. Section 2 presents a brief discussion of AGCDP as well as the naive and ILP formulation of the problem. Section 3 contains the detailed discussion of how AGCDP is represented as a graph problem. Proof of equivalence of the two representations is discussed in Section 4. Finally Section 5 concludes the paper. 2. APPROXIMATE GENE CLUSTER DIS- COVERY PROBLEM Necessary for our understanding of the problem are the fol- lowing definitions. 1. Gene A gene is represented by an integer g ∈Z 0 . Special genes represented by the integer 0 are genes with non existing homologs, with which we are not interested of in this problem. 2. Gene Universe The set of all unique genes is called the gene universe and is denoted by U = {0, 1, 2,...,N }.