Journal of Chromatography A, 1256 (2012) 150–159 Contents lists available at SciVerse ScienceDirect Journal of Chromatography A j our na l ho me p ag e: www.elsevier.com/locate/chroma New supervised alignment method as a preprocessing tool for chromatographic data in metabolomic studies Wiktoria Struck, Paweł Wiczling, Małgorzata Waszczuk-Jankowska, Roman Kaliszan, Michał Jan Markuszewski Department of Biopharmaceutics and Pharmacodynamics, Medical University of Gda´ nsk, Al. Gen. Hallera 107, 80-416 Gda´ nsk, Poland a r t i c l e i n f o Article history: Received 19 March 2012 Received in revised form 19 June 2012 Accepted 26 July 2012 Available online 2 August 2012 Keywords: Metabolomics HPLC Retention time shift Data alignment Warping Preprocessing methods a b s t r a c t The purpose of this work was to develop a new aligning algorithm called supervised alignment and to compare its performance with the correlation optimized warping. The supervised alignment is based on a “supervised” selection of a few common peaks presented on each chromatogram. The selected peaks are aligned based on a difference in the retention time of the selected analytes in the sample and the reference chromatogram. The retention times of the fragments between known peaks are subsequently linearly interpolated. The performance of the proposed algorithm has been tested on a series of simulated and experimental chromatograms. The simulated chromatograms comprised analytes with a systematic or random retention time shifts. The experimental chromatographic (RP-HPLC) data have been obtained during the analysis of nucleosides from 208 urine samples and consists of both the systematic and ran- dom displacements. All the data sets have been aligned using the correlation optimized warping and the supervised alignment. The time required to complete the alignment, the overall complexity of both algorithms, and its performance measured by the average correlation coefficients are compared to assess performance of tested methods. In the case of systematic shifts, both methods lead to the successful alignment. However, for random shifts, the correlation optimized warping in comparison to the super- vised alignment requires more time (few hours versus few minutes) and the quality of the alignment described as correlation coefficient of the newly aligned matrix is worse 0.8593 versus 0.9629. For the experimental dataset supervised alignment successfully aligns 208 samples using 10 prior identified peaks. The knowledge about retention times of few analytes’ in the data sets is necessary to perform the supervised alignment for both systematic and random shifts. The supervised alignment method is faster, more effective and simpler preprocessing method than the correlation optimized warping method and can be applied to the chromatographic and electrophoretic data sets. © 2012 Elsevier B.V. All rights reserved. 1. Introduction Now, in the era of evolving bioinformatics methods, there is a clear trend toward the analysis of the entire chromatographic data matrix, rather than the selected peaks detected in the chro- matograms. This broad approach does not require choosing the individual analytes for their integration and subsequent analysis, therefore, does not cause data loss. Such a holistic approach over- comes the problem with enormity of the data in the areas like metabolomics, which refers, inter alia, to the analysis of metabolic profiles, metabolic fingerprinting, as well as examines the inter- actions between levels of not necessarily identified metabolites. By analyzing the entire chromatographic data matrix instead of concentrations or peak areas of selected analytes, more relevant Corresponding author. Tel.: +48 58 349 3260; fax: +48 58 349 3262. E-mail address: markusz@gumed.edu.pl (M.J. Markuszewski). information can be extracted about the analyzed sample using appropriate classification and prediction methods. However prior to such chemometric analyses, it is necessary to align retention time shifts that occur either globally or in small sections of the chromatograms. The peaks are shifted because of the unavoid- able changes of the experimental conditions caused by the minor changes in the mobile phase composition, stationary phase prop- erties or by the impact of sample matrix (particularly in case of biological sample matrix such as urine or serum). Two types of peak shifts can be distinguished in a real set of chromatograms. In the first one, called systematic, the difference between reten- tion times of the corresponding analytes on the two consecutive chromatograms versus the retention time is a continuous func- tion. It is a very common situation and might be a consequence of column ageing, changes in chromatographic conditions, minor changes in the mobile phase composition, etc. Contrary, for the random displacement the difference between retention times of corresponding analytes is a random variable, so it affects each peak 0021-9673/$ see front matter © 2012 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.chroma.2012.07.084