What’s in the mix: phylogenetic classification of metagenome sequence samples Alice C McHardy 1 and Isidore Rigoutsos Metagenomics is a novel field which deals with the sequencing and study of microbial organisms or viruses isolated directly from a particular environment. This has already provided a wealth of information and new insights for the inhabitants of various environmental niches. For a given sample, one would like to determine the phylogenetic provenance of the obtained fragments, the relative abundance of its different members, their metabolic capabilities, and the functional properties of the community as a whole. To this end, computational analyses are becoming increasingly indispensable tools. In this review, we focus on the problem of determining the phylogenetic identity of the sample fragments, a procedure known as ‘binning’. This step is essential for the reconstruction of the metabolic capabilities of individual organisms or phylogenetic clades of a community, and the study of their interactions. Addresses Bioinformatics and Pattern Discovery Group, IBM Thomas J Watson Research Center, 1101 Kitchawan Road, Yorktown Heights, NY 10598, USA Corresponding author: McHardy, Alice C. (mchardy@mpi-inf.mpg.de) and Rigoutsos, Isidore (rigoutso@us.ibm.com) 1 Present address: Max Planck Institute for Informatics, Building E14, Stuhlsatzenhausweg 85, 66123 Saarbruecken, Germany. Current Opinion in Microbiology 2007, 10:499–503 This review comes from a themed issue on Genomics Edited by Claire M. Fraser-Liggert and Jean Weissenbach Available online 22nd October 2007 1369-5274/$ – see front matter # 2007 Elsevier Ltd. All rights reserved. DOI 10.1016/j.mib.2007.08.004 Introduction Metagenomics is a new field of activity which increases our understanding of microbial communities by sequen- cing genomic material directly from a community of a particular environment. In this, it represents a departure from reductionist studies that call for the analysis of individual organisms and their responses to limited types of stimuli. Instead, system level thinking requires that organisms be examined as integral members of the larger communities and of the environments to which they belong and that the functional properties of such com- munities be elucidated. Already, the field has generated considerable advances in understanding microbial and viral communities from diverse environments, for which a vast fraction of the inhabitants cannot be obtained in pure culture with standard techniques (see e.g. [1– 10,11 ]). Different from conventional genomics, metagenomics does not require pure clonal cultures of individual organ- isms for sequencing. Instead, DNA from the mix of different populations of a particular microbial community is sequenced with one of the currently available tech- niques. Metagenome studies of a community at the level of individual populations and their interactions require the reconstruction of genomic entities based on the sequenced reads. Generated reads with overlaps are subsequently assembled into scaffolds of varying lengths. The average lengths of the generated scaffolds depends strongly on the number of the distinct populations in the sample, the size and architecture of their genomes, and their relative abundance in the sampled community. In general, the more complex the community, the smaller is the average length of each scaffold. To address the problem of assigning individual fragments to sample populations or higher-level clades, various techniques have been developed. Methods for the phylogenetic characterization of the samples A number of approaches can be used for the phylogenetic characterization of fragments. Assignments can be per- formed based on the construction of phylogenies using conserved and universally present markers such as ribo- somal RNA [12]. This method is considered the gold standard in phylogenetic taxonomy and the most accu- rate. Ribosomal RNA has been sequenced extensively and a large library of reference sequences exists (see e.g. [13]). An extended list of markers that is currently in use includes about 25–30 highly conserved genes [14]. The use of less conserved genes could in principle increase fragment coverage, but this would come at a specificity cost because of noise caused by horizontal gene transfer, gene duplication, loss and missing reference points in the space of phylogenetic homologs. Alternatively, homologs found by database searches with one of the many available methods (Blast, Fast, Smith- Waterman, etc.) can be used for assignment. In [15], the authors report that evaluation of simple Blast-homologies allowed the accurate assignment of even very short frag- ments from genomes contained in the query database. This permits the assignment of fragments from sequenced species, such as the Shewanella and Burkholderia popu- lations in the Sargasso Sea metagenome. Using stringent www.sciencedirect.com Current Opinion in Microbiology 2007, 10:499–503