Pattern Recognition of Soil Samples Based on the Microbial Fatty Acid Contents XIN-HUA SONG AND PHILIP K. HOPKE* Departm ent of Chem istry, Clarkson University, Potsdam , New York 13699-5810 MARY ANN BRUNS, KEN GRAHAM, AND KATE SCOW Department of Land, Air, and Water Resources, University of CaliforniasDavis, Davis, California 95616 The problem of distinguishing particles in ambient airborne particulate matter derived from different soils is difficult when based only on elemental composition. However, biologically derived chemical species associated with a specific crop in a farming area could be more useful in discriminating soil samples. Phospholipid fatty acids (PLFAs) extracted from microorganisms in soils have been used to fingerprint soil microbial community composition. A set of 72 PLFAs was found to occur in at least 10% of all soil samples studied, and data on these PLFAs were used to distinguish soils planted with different crops. This paper describes the application of discriminant partial least squares (D-PLS) and regularized discriminant analysis (RDA) to PLFA data from soils. A variable selection approach based on the PLS regression coefficients has been proposed to identify the most important PLFA variables for the classification and to improve classification results. RDA uses a regularized covariance matrix estimate for the conventional statistical discriminant analysis methods and provides some advantages. The results showed that both the D-PLS and RDA methods provided satisfactory performance in classifying soil samples, with RDA being slightly better. The study also indicated that the variable selection strategy was able to improve the classification results and to help identify the most important PLFAs for distinguishing soils. The best classification performance has been achieved by applying the RDA analysis to the selected-variable PLFA data. Introduction PM10 is respirable particulate matter with an aerodynamic diameterof10 μm or less (1). PM10 can consist ofsoil-derived dust, soot, crystalline chemicals, or microbial aerosols, but in some areas on a seasonal basis, airborne soil dust constitutes the largest fraction of PM10 (2). Distinguishing particles in ambient PM10 samples as being derived from soils would be extremely useful in air quality management (3). Furthermore, the ability to distinguish one type of soil from another would assist in identifying sources of soil- derived PM10. Distinguishing soils based only on elemental analysis is difficult because of the similarity in soil composi- tions. However, biologically derived chemical species that clearly identify the associated microflora of a specific crop could provide additional information for differentiating soil samples. Phospholipid fatty acids (PLFAs) extracted from soils can be used to fingerprint the structure ofsoilmicrobial communities (4-7). Unlike other extractable fatty acids in soil organic debris, PLFAs are cell membrane components and undergo rapid degradation following cell lysis (8). The types and amounts of PLFAs consequently reflect the composition of viable biological assemblages contained within environmental samples. As the use of PLFA fingerprinting of soil microbial communities becomes more common, it is essential to understand how reliably the PLFA contents represent the microbialcommunityin a soil.It isalso important to examine specific PLFAs within the whole data set that are most responsible for differentiating soil microbial communities. Soil microorganisms are usually affected by different agri- cultural management practices, particularly by the crops grown on soils (9). In a previous study (10), a small subset of26 PLFAs was chosen as the initialbasis for the classification of soil samples by an artificial neural network method. The results indicated that it was possible to identify one soil/ crop combination from another based on the PLFAcontents. However,for a more complextaskinvolvingmore crop types, the fullset ofPLFAs could provide a better basis for classifying soil samples by using appropriate multivariate analysis methods. In this study, an expanded data set has been taken to explore the possibility of distinguishing soils related to nine crop types. Atotal of72 PLFAs that were found to be present in at least 10% of all the samples have been included. Since many of these PLFAs are correlated with one another, two powerfulmultivariate data analysismethodsthat can directly deal with correlations among variables have been applied. They are partial least squares (PLS) (11-13) and regularized discriminant analysis (RDA) (14, 15). PLS has been widely used as a multivariate calibration tool, but it can also be applied asa modelingmethod to solve a pattern classification problem. It is then called discriminant PLS (D-PLS) (16, 17). RDA uses a regularized covariance matrix estimate for the conventionalstatisticaldiscriminantanalysismethods,such as linear discriminant analysis (LDA) and quadratic dis- criminant analysis (QDA). RDAhas proven to be a powerful discriminant analysis method, and it always gives results equivalent to or better than LDA and QDA (18-20). In addition to presenting the novel application of the D-PLS and RDA methods to PLFA data to distinguish soils depending on the crops grown on them, a second objective of this study was to examine the specific PLFAs that play more important roles in differentiating soil samples. Both the PLS and RDAmethods were intended and developed as full-variable multivariate analysis techniques that could be used when large numbersofindependent variablesorhighly correlated independent variables were involved. However, in some cases, simpler and therefore probably more robust methods are possibly achieved by focusing on selective and discriminative variables. This strategy is also more under- standable because investigators often want to know which variables are more responsible for the classification. It was recentlyshown that the PLSmodel’s performance could also be improved by feature selection (21, 22). Avariable selection approach based on the PLSregression coefficients has been proposed to improve the classification performance and to better understand which PLFAvariables playmore important rolesin the classification.The approach *Correspondingauthor telephone: (315)268-3861;fax: (315)268- 6610; e-mail: hopkepk@clarkson.edu. Environ. Sci. Technol. 1999, 33, 3524-3530 3524 9 ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 33, NO. 20, 1999 10.1021/es990405n CCC: $18.00 1999 American Chemical Society Published on Web 09/10/1999