ORIGINAL ARTICLE Assignment of MS-based metabolomic datasets via compound interaction pair mapping Geoffrey T. Gipson Æ Kay S. Tatsuoka Æ Bahrad A. Sokhansanj Æ Rachel J. Ball Æ Susan C. Connor Received: 10 May 2007 / Accepted: 1 October 2007 / Published online: 11 December 2007 Ó Springer Science+Business Media, LLC 2007 Abstract Assignment of physical meaning to mass spectrometry (MS) data peaks is an important scientific challenge for metabolomics investigators. Improvements in instrumental mass accuracy reduce the number of spurious database matches, however, this alone is insufficient for accurate, unique high-throughput assignment. We present a method for clustering MS instrumental artifacts and a stochastic local search algorithm for the automated assignment of large, complex MS-based metabolomic datasets. Artifact peaks and their associated source peaks are grouped into ‘‘instrumental clusters.’’ Instrumental clusters, peaks grouped together by shared peak shape in the temporal domain, serve as a guide for the number of assignments necessary to completely explain a given dataset. We refine mass only assignments through the intersection of peak correlation pairs with a database of biochemically relevant interaction pairs. Further refinement is achieved through a stochastic local search optimization algorithm that selects individual assignments for each instrumental cluster. The algorithm works by choosing the peak assignment that maximally explains the connectivity of a given cluster. We demonstrate that this methodology provides a significant advantage over standard methods for the assignment of metabolites in a UPLC-MS diabetes dataset. Keywords Peak assignment Á Chromatography Á Mass spectrometry Á Metabolomics Á Informatics 1 Introduction Mass spectrometry (MS) methods are important data plat- forms for metabolomics investigators (Want et al. 2005). A particular challenge of global metabolite profiling, whether using MS or nuclear magnetic resonance (NMR), is assignment of spectral peaks of interest (Kell 2004). Previ- ously described informatics methods have been developed to help to reduce this major bottleneck, although most of the approaches have not yet been fully validated in the context of analytically confirmed assignments. The proposed solu- tions have employed mass only database search methods (Smith et al. 2006), refined mass database search methods utilizing isotopic patterns (Kind and Fiehn 2006), mass spectral libraries (Kopka et al. 2005), and ab initio mass transformation pairs (Breitling et al. 2006a, b) for the putative assignment of metabolites in high-throughput metabolomic datasets. Correlation networks of the assigned components of metabolomic datasets have been suggested for the con- struction of metabolic networks (Arkin et al. 1997; Steuer et al., 2003a). Although metabolic neighbors in shared biochemical pathways have been observed to be signifi- cantly correlated, evaluations of modeled and experimental data suggest that observed correlation networks do not Electronic supplementary material The online version of this article (doi:10.1007/s11306-007-0096-9) contains supplementary material, which is available to authorized users. G. T. Gipson (&) Á B. A. Sokhansanj School of Biomedical Engineering, Science, and Health Systems, Drexel University, 3141 Chestnut Street, Philadelphia, PA 19104, USA e-mail: gtg25@drexel.edu G. T. Gipson Á K. S. Tatsuoka GlaxoSmithKline, Informatics, 1250 South Collegeville Road, Collegeville, PA 19426, USA G. T. Gipson Á R. J. Ball Á S. C. Connor Safety Assessment, GlaxoSmithKline, Park Road, Ware, Herts SG12 0DP, UK 123 Metabolomics (2008) 4:94–103 DOI 10.1007/s11306-007-0096-9