Available online at www.sciencedirect.com
Journal
'Microbiological
Methods
SCIENCE CLDIRECT .
Journal of Microbiological Methods 65 (2006) 49-62 ELSEVIER
www.elsevier.com/locate/jmicmeth
An ecoinformatics tool for microbial community studies:
Supervised classification of Amplicon Length
Heterogeneity (ALH) profiles of 16S rRNA
Chengyong Yang a , DeEtta Mills b , Kalai Mathee b , Yong Wang a, Krish Jayachandran C ,
Masoumeh Sikaroodi d, Patrick Gillevet d, Jim Entry e, Giri Narasimhan a ' *
a Bioinformatics Research Group (BioRG), School of Computer Science, Florida International University, Miami, Florida, 33199, USA
bDepartment of Biological Sciences, Florida International University, Miami, Florida, USA
'Department of Environmental Sciences, Florida International University, Miami, Florida, USA
d Microbial and Environmental Biocomplexity, Department of Environmental Sciences and Policy, George Mason University,
Manassas, Virginia, USA
'USDA Agricultural Research Service, Northwest Irrigation and Soils Research Laboratory, Kimberly, Idaho, USA
Received 18 January 2005; received in revised form 22 April 2005; accepted 24 June 2005
Available online 27 July 2005
Abstract
Support vector machines (SVM) and K-nearest neighbors (KNN) are two computational machine learning tools that
perform supervised classification. This paper presents a novel application of such supervised analytical tools for microbial
community profiling and to distinguish patterning among ecosystems. Amplicon length heterogeneity (ALH) profiles from
several hypervariable regions of 16S rRNA gene of eubacterial communities from Idaho agricultural soil samples and from
Chesapeake Bay marsh sediments were separately analyzed. The profiles from all available hypervariable regions were
concatenated to obtain a combined profile, which was then provided to the SVM and KNN classifiers. Each profile was
labeled with information about the location or time of its sampling. We hypothesized that after a learning phase using
feature vectors from labeled ALH profiles, both these classifiers would have the capacity to predict the labels of previously
unseen samples. The resulting classifiers were able to predict the labels of the Idaho soil samples with high accuracy. The
classifiers were less accurate for the classification of the Chesapeake Bay sediments suggesting greater similarity within the
Bay's microbial community patterns in the sampled sites. The profiles obtained from the VI +V2 region were more
informative than that obtained from any other single region. However, combining them with profiles from the V1 region
(with or without the profiles from the V3 region) resulted in the most accurate classification of the samples. The addition
* Corresponding author. Tel.: +1 305 348 3748; fax: +1 305 348 3549.
E-mail address: giri@cs.fiu.edu (G. Narasimhan).
0167-7012/$ - see front matter © 2005 Elsevier B.V. All rights reserved.
doi:10.1016/j.mimet.2005.06.012