MIXTURE MODEL CLUSTERING FOR PEAK FILTERING IN METABOLOMICS Simon Rogers 1 ,R´ on´ an Daly 2 and Rainer Breitling 2,3 1 School of Computing Science, University of Glasgow, UK 2 Institute of Molecular, Cell and Systems Biology, University of Glasgow, UK 3 Groningen Bioinformatics Centre, University of Groningen, Netherlands simon.rogers@glasgow.ac.uk, ronan.daly@glasgow.ac.uk, rainer.breitling@glasgow.ac.uk ABSTRACT In recent years, the use of liquid chromatography coupled to mass spectrometry has enabled the high-throughput pro- ﬁling of the metabolic composition of biological samples. However, the large amount of data obtained is often difﬁ- cult to analyse. This paper focuses on a particular problem, that of detecting and potentially removing derivative peaks of a substance of interest. A mixture model for clustering peaks based on chromatographic peak shape correlation is presented, and comparison of this model to the behaviour of a leading mass spectrometry analysis tool is presented. Based on the results, the mixture model is shown to have better overall performance characteristics. 1. INTRODUCTION Recent studies have shown that changes in an organism’s metabolome composition are more closely correlated with phenotypic variation than changes in either the transcrip- tome or the proteome [1], motivating the use of high-through- put metabolomic assays for a wide variety of biomedical applications. The most popular method for analysing the metabolome is mass spectrometry (MS) coupled to a separa- tion phase resulting in data consisting of a set of chromato- graphic peaks characterised by their mass and retention time. Identifying the peaks in these spectra makes metabolomic analysis more challenging than, for example, microarray analysis, where the structure of the mRNA makes it possible to build a probe to which the molecule of interest will bind with high speciﬁcity. Whilst, in some cases, matching a measured mass to a database of known masses will result in a correct identiﬁcation, sometimes several matches might be found within a window deﬁned by the mass accuracy of the equipment [2]. The problem is made more complicated by the fact that an experimental sample containing a few dozen metabolites will result in the production of several hundred peaks. The large number of additional peaks can come from various sources, including isotopes, adducts, molecular fragments, and multiply charged ions [3]. As some of these derivative peaks will have masses very similar to known metabolites, ﬁltering them out is crucial to avoid the overwhelming num- ber of false identiﬁcations that would be produced if they were all compared to a database of known masses. Figure 1. A chromatogram, showing peak intensity as a func- tion of retention time in the separation phase. Co-eluting peaks often have a similar shape; the peaks tightly clustered around 863 s (boxed) are derived from the same metabolite. Fortunately, peaks that are derived from the same meta- bolite share characteristics that make it possible to group them together. In particular, they will elute from the sepa- ration phase at the same time (their retention times will be very similar) and their peaks will have similar shapes [3], as can be seen in Figure 1. Here, we introduce a mixture model for clustering peaks based on the correlation between their peak shapes. The remainder is organised as follows: In the next section we brieﬂy introduce mzMatch, one of the most popular open-source metabolomics analysis pipelines, as this is the system against which we compare our pro- posed approach. In Section 3 we introduce our model and in Section 4 demonstrate its performance on some standard metabolomic datasets. Finally, in Section 5 we present a brief discussion and conclusions. 2. MZMATCH mzMatch is a popular open-source metabolomic data anal- ysis pipeline that enables researchers to perform end-to-end analysis of diverse biological datasets [4]. Whilst the com- plete toolkit performs many functions from peak extraction