RESEARCH ARTICLE Searching for biomarkers of heart failure in the mass spectra of blood plasma Richard Willingale 1 , Donald J. L. Jones 2 , John H. Lamb 2 , Paulene Quinn 3 , Peter B. Farmer 2 and Leong L. Ng 3 1 Department of Physics and Astronomy, University of Leicester, Leicester, UK 2 Cancer Biomarkers and Prevention Group, Biocentre, University of Leicester, Leicester, UK 3 Department of Cardiovascular Sciences, Clinical Sciences Building, Leicester Royal Infirmary, Leicester, UK We have developed a technique for analysing blood plasma using MALDI-MS with subsequent data analysis to identify significant and specific differences between heart failure (HF) patients and healthy individuals. A training dataset comprising 100 HF patients and 100 healthy individ- uals was used to search for biomarkers (m/z range 1000–10 000). EWP cartridges when used in tandem with microcon centrifugal filters were found to give the best results. A data management chain including event binning, background subtraction and feature extraction was developed to reduce the data, and statistical analysis was used to map feature intensities on to a common scale. Various mathematical approaches including a simple cumulative score, support vector machi- nes (SVM) and genetic algorithms (GAs) were then used to combine the results from individual features and provide a robust classification algorithm. The SVM gave the most promising results (accuracy 95%, receiver operating characteristic (ROC) score of 0.997 using 18 selected features). Finally, a test dataset comprising a further 32 HF patients and 20 controls was used to verify that the 18 putative biomarkers and classification algorithms gave reliable predictions (accuracy 88.5%, ROC score 0.998). Received: May 18, 2006 Accepted: July 26, 2006 Keywords: Biomarkers / Data management / Heart failure / Mass spectrometry / Normalisation Proteomics 2006, 6, 5903–5914 5903 1 Introduction Potential biomarkers of heart failure (HF) may be present in mass spectra of blood plasma or sera produced by MALDI- MS or similar techniques. Given the inherent variation in human populations it is likely, in the search for reliable di- agnostic biomarkers, that interpretations from multiple markers rather than single measurements may prove to be more applicable. Following mass spectrometric analysis, interrogation of the large multidimensional datasets using various mathematical methodologies can provide powerful diagnostic potential. The identification of similar biomarker patterns has been attempted for early stroke diagnosis [1], detection of ovarian cancer [2], and superficial bladder cancer [3]. Methods for the detection and identification of marker proteins in serum samples have been described by Baggerly et al. [4] and Zhu et al. [5] and the problems of reproducibility and noise discussed by Anderle et al. [6]. A number of data analysis techniques have been employed in this type of approach namely artificial neural networks (ANN) [7], genetic algorithum (GA) [2], decision forests (DF) [8] and support vector machines (SVM) [1, 9]. The detection and identification of such markers is difficult because of natural variation of peptide constituents in the blood and variation in the spectra introduced by the experimental extraction and Correspondence: Dr. Donald J. L. Jones, Cancer Biomarkers and Prevention Group, University of Leicester, Leicester LE1 7RH, UK E-mail: djlj1@le.ac.uk Fax: 144-116-223-1840 Abbreviations: ANN, artificial neural network; FWHM, full width at half maximum; GA, genetic algorithm; HF , heart failure; RFI, recursive feature inclusion; ROC, receiver operating characteris- tic; SVM, support vector machine DOI 10.1002/pmic.200600375 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com