Identifying Complex Biological Interactions based on Categorical Gene Expression Data Ben Goertzel, Cassio Pennachin, Lúcio de Souza Coelho and Maurício Mudado Abstract— A novel method, MUTIC (Model Utilization- based Clustering), is described for identifying complex interactions between genes or gene-categories based on gene expression data. The method deals with binary categorical data, which consists of a set of gene expression profiles divided into two biologically meaningful categories. It does not require data from multiple time points. Gene expression profiles are represented by feature vectors whose component features are either gene expression values, or averaged expression values corresponding to GO or PIR categories. A supervised learning algorithm (genetic programming) is used to learn an ensemble of classification models distinguishing the two categories based on the feature vectors corresponding to their members. Each feature is associated with a “model utilization vector,” which has an entry for each high-quality classification model found, indicating whether or not the feature was used in that model. These utilization vectors are then clustered using a variant of hierarchical clustering called Omniclust. The result is a set of model-utilization-based clusters, in which features are gathered together if they are often considered together by classification models – which may be because they’re co-expressed, or may be for subtler reasons involving multi-gene interactions. The MUTIC method is illustrated via applying it to a dataset regarding gene expression in human brains of various ages. Compared to traditional expression-based clustering, MUTIC yields clusters that have higher mathematical quality (in the sense of homogeneity and separation) and also yield novel insights into the underlying biological processes. I. INTRODUCTION variety of methodologies for analyzing gene expression data have arisen in recent years, including but not limited to: identifying which genes are maximally differentiated between two categories; clustering genes based on coexpression across multiple samples or multiple experiments [1]-[8]; using supervised categorization algorithms to learn rules distinguishing two or more categories of gene expression profiles from each other [9]- [13]; and inference of genetic interaction networks from gene expression time series data [14]-[18]. These methodologies serve various purposes, such as induction of diagnostic models, qualitative understanding of the biological phenomena underlying a dataset, and identification of specific actors (e.g. genes, proteins) that may be involved in a certain biological phenomenon. In this paper we present a novel methodology for gene expression data analysis, whose goal is to identify those interactions Ben Goertzel (e-mail: bgoertzel@biomind.com), Cassio Pennachin (e- mail: cassio@biomind.com), Lúcio de Souza Coelho (e-mail: lucio@biomind.com) and Maurício Mudado (mauricio@biomind.com) work at Biomind LLC, Rockville, Maryland. between genes, proteins, and biological processes that are most relevant to the phenotypic distinction underlying a given binary categorization of gene expression profiles. Clustering is the most common tool for interaction identification. By determining which genes or gene- categories have expression-value profiles that cluster together across multiple samples or multiple experiments, one gets a picture of which genes are “associated” with each other. These associations do not usually have a clear interpretation, however, as co-expression can occur for a variety of reasons. Furthermore, many types of interactions are in principle not identifiable via directly clustering gene expression values. For instance, one won’t recognize ternary interactions wherein, say, C is only highly expressed when both A and B are highly expressed together. The technique we describe here, MUTIC (Model Utilization-based Clustering), is oriented toward capturing interactions that ordinary expression-based clustering misses. The end result of MUTIC looks superficially similar to that of traditional gene expression clustering: one obtains a set of clusters (of genes or gene-categories), where the elements of a cluster are hypothesized to have a significant interrelationship. What is novel is that these clusters are not determined based on co-expression but via a more involved analysis. The semantics of the clusters is different: MUTIC clusters represent genes or gene-categories that are usefully considered in combination when formulating classification rules distinguishing one category of gene expression profiles from another. The elements of such a cluster may or may not be coexpressed across the set of gene expression profiles under analysis. Here we describe the MUTIC method and then briefly discuss its application to a dataset regarding gene expression in human brain cells, collected in a study of the neurogenetics of aging [19]. In the context of this dataset, we review a number of potentially interesting biological interactions that the new method finds but traditional expression-based clustering misses. We also analyze homogeneity and separation properties of MUTIC clusters, coming to the conclusion that they possess significantly greater cluster quality than clusters found via traditional gene expression clustering. II. THE MUTIC ALGORITHM A. Data Requirements and Pre-processing MUTIC deals with data which is categorical: i.e. one must start with a set of gene expression profiles belonging to one A