IOSR Journal of Mathematics (IOSR-JM) e-ISSN: 2278-5728, p-ISSN: 2319-765X. Volume 13, Issue 3 Ver. III (May - June 2017), PP 24-32 www.iosrjournals.org DOI: 10.9790/5728-1303032432 www.iosrjournals.org 24 | Page A priori groups based on Bhattacharyya distance and partitioning around medoids algorithm (PAM) with applications to metagenomics Rodríguez-Casado, Clara I. 1 ; Monleón-Getino, Toni 1,2,+ ; Cubedo, Marta 1 ; Ríos-Alcolea, Martín 1,2 1 (Department of Genetics, Microbiology and Statistics, University of Barcelona, Barcelona, Spain) 2 (Research Group in Biostatistics and Bioinformatics (GRBIO), Barcelona, Spain) + (Correspondance author) Abstract:Plants, animals and humans live in close association with microbial organisms. Increasingly, biologists have come to appreciate that microbes make up an important part of an organism's phenotype. This microbial community contains a unique complexity that makes it difficult to study their diversity. However, for many questions on the structure of the microbial community one only needs to know the relative order of diversity among samples rather than the total diversity. Unfortunately the culture of microorganisms can be complex but this has prompted the development of new scientific methodologies for their study. One of these methodologies is metagenomics. An important problem in metagenomics is measuring the dissimilarity between distributions of features, such as taxons or groups. The focus of this note is the proposal of a new method based on using Bhattacharyya distance and establishing a priori groups using the partitioning around medoids algorithm (PAM). The results reveal a good reduction in the size of the dataset and an interesting way of revealing possible subgroups “a priori” or communities among the microorganisms that make up the analyzed sample. Keywords:Multivariate methods, applied statistical methods, data analysis, multidimensional scaling, metagenomics, cluster, biology, microbiology, metrical distances I. Introduction The gut microbiota is home to more than 99% of the genetic information in humans and although there is an important connection between the gut microbiome and metabolism, immune health, disease, autism, allergies, and obesity, it remains a largely unexplored area of science [1]. Microbial communities contain a unique complexity that makes it difficult to study their diversity. However, for many questions on the structure of the microbial community one only needs to know the relative order of diversity among samples rather than total diversity. Unfortunately the culture of microorganisms can be complex, prompting the development of new scientific methodologies for their study. One of these methodologies is metagenomics. Metagenomics (also referred to as environmental and community genomics) is the study of genetic (genomic analysis of microorganisms) material recovered directly from environmental samples by direct extraction and cloning of DNA from an assemblage of microorganisms [2]. In any biological system information is ultimately linked to the DNA sequences present, and microbial communities are no exception. In microbial communities we used „word‟ frequency profiles of operational taxonomic units (OTUs) as a proxy for the composition of the bacterial community at the genomic level, thus avoiding the need to define bacterial species or taxonomic groups [3]. The broad field of metagenomics may also be referred to as environmental genomics, ecogenomics or community genomics. While traditional microbiology and microbial genome sequencing and genomics rely upon cultivated clonal cultures, early environmental gene sequencing cloned specific genes (often the 16S rRNA gene) to produce a profile of diversity in a natural sample [4]. The development of metagenomics stemmed from the ineluctable evidence that as-yet-uncultured microorganisms represent the vast majority of organisms in most environments on earth. This evidence was derived from analyses of 16S rRNA gene sequences amplified directly from the environment, an approach that avoided the bias imposed by culturing and led to the discovery of vast new lineages of microbial life [2]. In a very recent study [3], we addressed the question of how to explore diversity (species richness) and complexity (frequency distribution) in microbial communities directly from a limited amount of metagenomic data and how to characterize communities efficiently. For this purpose we built the library MetagenOutLDA. Next generation sequencing and other recent techniques applied to microbial metagenomics have transformed the study of microbial diversity. Microbial metagenomics, or sequencing of DNA extracted from microbial communities, provides a means to determine what organisms are present without the need for