A Comparison of Community Identication Algorithms for Regulatory Network Motifs Douglas Oliveira and Marco Carvalho Abstract— In the recent years high throughput data about biological processes has become available and thus opened a wide range of possibilities of research in multi-disciplinary areas, like network science. An idea that has been widely accepted is the fact that no life can exist without complex systems formed by interacting macromolecules. Rather than a single gene being responsible for a single phenotype (central dogma), it has been shown that the interaction between several genes is responsible for a given phenotype, a concept called System Biology. Identifying patterns of interactions (motifs) in these complex networks has attracted the attention in the scientiﬁc community, given that these networks are often very dense and dynamic. In this work we focus on a particular kind of biological network, a regulatory network where each node is a transcription factor and two nodes are connected if one of them encodes a transcription factor to another one that is regulated by this transcription factor. We focus on a speciﬁc kind of motif, a dense overlapping region (DOR) that claims that a set of genes regulated by different transcription factors are more overlapping than expected at a random network. We use different community identiﬁcation algorithms in order to identify which algorithm best suits to the task of identiﬁcation of this particular motif. I. I NTRODUCTION According to [1] most of the interesting accomplishments achieved in biological research has been in genomics. One example is the genome sequencing of many species, includ- ing the human genome, which has created many possibilities for a better understanding of the function of many genes from large-scale sequencing processes. We currently have a good understanding of life at the molecular level, and recognize that we need to see gene structures not only in isolation but also as sets, and how they interact with one another [2]. By accepting the concept of system biology, we are not denying the importance of reductionist approaches. Reduc- tionist approaches are just limited concerning the function of presenting a comprehensive picture of life [1]. One fact that supports the idea of system biology is that individual cells when separated from their neighbors lose many of their functional and structural attributes [3]. The notion of systems biology dates back from hundreds of years ago when the word organism was initially used to describe living animals and plants as organizations, where each part is reciprocally end and means. Many advantages have rise with this new approach like, for example, evolution- ary mechanisms can be better understood in light of complex molecular systems [4]. D. Oliveira and M. Carvalho are with Florida In- stitute of Technology, 150 W. University Blvd, Mel- bourne, FL, USA doliveira2011@my.fit.edu, mcarvalho@cs.fit.edu With the current availability of terabytes of data in many domains, including biological processes, communications, and social interactions, a variety of research actives have started to focus on modeling and identiﬁcation of global network properties and characteristics. These include the small world property [5] and scale-free networks [6]. One of the ﬁrst networks structures analyzed with this approach was the network representing scientiﬁc collaborations and co-publications [7]. While important, such global metrics must be augmented with the understanding of basic structural elements, the building blocks of the network. These building blocks are often referred to as network motifs [8] and represent recurring structures and patterns of connections. In [8] the authors present several different kinds of motifs normally found in different types of networks. In their work, the authors justify the presence of the motifs to the way in which the network was designed. More speciﬁcally in biological networks the work of [9] identiﬁes three major patterns that are signiﬁcantly present in the network. Among them, a motif called dense overlapping regulons (DOR), requires special attention. The motif is deﬁned as a layer of overlapping interactions that is much more dense than the corresponding structures in randomized networks. The result is a structure characterized by loosely connected and internally dense regions of interactions. These regions are often called communities. There are many community identiﬁcation algorithms in literature. In general, such algorithms rely on the partition of the data into a certain number of communities (groups, subsets or categories) [10]. There is no clear deﬁnition of a community, but most authors characterize a community by its internal homogeneity and the external separation [11]. In this work we evaluate the results of four community identiﬁcation algorithms aiming to identify which bets suits for the identiﬁcation of DOR motifs in a regulatory network. II. RELATED WORK Gene expression data is obtained through microarray ex- periments [16] and is commonly used for study of biological networks. Community identiﬁcation algorithms have been widely applied in these kinds of datasets, for example for the construction of coexpression networks [12]. In a coex- pression network each node represents a gene, and two nodes are connected if their expression levels are similar [13]. The work of [14] shows results of clustering 118 genes using a hierarchical community identiﬁcation algorithm in which members of the same clusters tend to participate in common processes. In a later work [15], the authors 978-1-4799-3163-7/13/$31.00 ©2013 IEEE