Expert system for clustering prokaryotic species by their metabolic features Clara Higuera a,b,⇑ , Gonzalo Pajares b , Javier Tamames c , Federico Morán a a Dpt. Bioquímica y Biología Molecular I, Facultad de Ciencias Químicas, Universidad Complutense de Madrid, avda. Complutense s/n, 28040 Madrid, Spain b Dpto. Ingeniería del Software e Inteligencia Artificial, Facultad Informática, Universidad Complutense, C/ Prof. José García Santesmases, s/n. 28040 Madrid, Spain c Centro Nacional de Biotecnología, National Research Council (CSIC), c/Darwin, 3. Cantoblanco, 28049 Madrid, Spain article info Keywords: Expert system Clustering Self-organizing Maps Clustering validity indices Metabolism Prokaryotic species abstract Studying the communities of microbial species is highly important since many natural and artificial pro- cesses are mediated by groups of microbes rather than by single entities. One way of studying them is the search of common metabolic characteristics among microbial species, which is not only a potential mea- sure for the differentiation and classification of closely-related organisms but also their study allows the finding of common functional properties that may describe the way of life of entire organisms or species. In this work we propose an expert system (ES), making the main contribution, to cluster a complex data set of 365 prokaryotic species by 114 metabolic features, information which may be incomplete for some species. Inspired on the human expert reasoning and based on hierarchical clustering strategies, our pro- posed ES estimates the optimal number of clusters adequate to divide the dataset and afterwards it starts an iterative process of clustering, based on the Self-organizing Maps (SOM) approach, where it finds rel- evant clusters at different steps by means of a new validity index inspired on the well-known Davies Bouldin (DB) index. In order to monitor the process and assess the behavior of the ES the partition obtained at each step is validated with the DB validity index. The resulting clusters prove that the use of metabolic features combined with the ES is able to handle a complex dataset that can help in the extraction of underlying information, gaining advantage over other existing approaches, that may relate metabolism with phenotypic, environmental or evolutionary characteristics in prokaryotic species. Ó 2013 Elsevier Ltd. All rights reserved. 1. Introduction Trying to understand the communities of microbial species is highly important because many natural and artificial processes are mediated by groups of microbes rather than by isolated enti- ties. In order to create artificial communities or manipulate the existing ones it is necessary to comprehend the specific require- ments of the individual species and to be able at a long term to pre- dict in which conditions are they able to survive. One kind of microorganisms which are important in life are the prokaryotes and a way of studying them have been since many years trying to categorize (Hong, Kim, & Lee, 2004) the huge variety of prokaryotic organisms which is itself a challenging task. One of the reasons is the lack of a globally accepted concept of species for prokaryotes and the fact that their taxonomy is continuously being influenced by the advances in microbial population genetics, ecol- ogy and genomics (Gevers et al., 2005). When it comes to assign an unknown bacteria to a species the experts usually do it identifying phenotypic or genome similarity. The traditional method to classify prokaryotes has been since many years the identification of the 16S rRNA (Jain, Wang, Liao, & Boyd, 2009), a sequence highly conserved through evolution. It allows to find differences among microorganisms and build evolu- tionary trees, also called phylogenetic trees that show the evolu- tionary relationships among species that are believed to possess a common ancestor. Although the analysis of 16S rRNA has been widely and success- fully applied, experts have started to look for other kinds of infor- mation which may shed some light into the differentiation of prokaryotic species. One of them is the search of common meta- bolic characteristics, which some authors suggest to be not only a potential measure for the classification or differentiation of clo- sely-related organisms (Lee et al., 2012) but also that their study may allow the finding of common functional properties that tradi- tional methods such as the analysis of 16S rRNA is not able to find (Jain et al., 2009). In biochemistry a metabolic pathway consists of a set of reac- tions that take place inside the cell, it involves the transformation of substrates into different products necessary for maintaining its 0957-4174/$ - see front matter Ó 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.eswa.2013.05.013 ⇑ Corresponding author at: Dpt. Bioquímica y Biología Molecular I, Facultad de Ciencias Químicas, Universidad Complutense de Madrid, avda. Complutense s/n, 28040 Madrid, Spain. Tel.: +34 91 394 4265. E-mail addresses: pajares@fdi.ucm.es, clarahiguera@ucm.es (C. Higuera). Expert Systems with Applications 40 (2013) 6185–6194 Contents lists available at SciVerse ScienceDirect Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa