Mass Spectrometry Analysis Via Metaheuristic Optimization Algorithms Syarifah Adilah M.Y., Ibrahim Venkat, Rosni Abdullah, Umi Kalsom Yusof 1,2,3,4 School of Computer Sciences Universiti of Sciences Malaysia 11800 Penang, Malaysia. 1 Dept. of Computer Sciences and Matematics Universiti Teknologi MARA Pulau Pinang 13500 Penang, Malaysia. 1 syarifah.adilah@ppinang.uitm.edu.my, 2 ibrahim@cs.usm.my, 3 rosni@cs.usm.my and 4 umiyousof@cs.usm.my Abstract—Biologically inspired metaheuristic techniques for exatracting salient features from mass spectrometry data has been recently gaining momentum among related fields of re- search viz., bioinformatics and proteomics. Such sophisticated approaches provide efficient ways to mine voluminous mass spectrometry data in order to extract potential features by getting rid of redundant information. This feature extraction process ultimately aids in discovering disease-related protein patterns in complex mixtures that is easily obtained from biological fluids such as serum and urine. This article provides an overview of such typical bio-inspired approaches. Index Terms—metaheuristics; bioinformatics; feature selec- tion; proteomics; I. I NTRODUCTION Analysis of biomarkers based on their diagnostic and prognostic potentials has been growing as an active area of bioinformatics oriented cancer research [1]. Well known mass spectrometry techniques such as Matrix- Assisted Laser Desorption/Ionization Time-Of-Flight Mass Spectrometry (MALDI-TOF-MS) and Surface- Enhanced Laser Desorption/Ionization Time-Of-Flight Mass Spectrometry (SELDI-TOF-MS) generate high throughputs of proteomics patterns, structures of proteins, from complex mixtures such as serum, urine, nipple aspirate fluids and so on. Clinical researchers use to identify new biomarkers from these associated protein expression levels. The output of this Mass Spectrometry (MS) analysis is a spectrum, which can be represented as a xy-graph in terms of ratio of mass to charge ratio (m/z ) versus ionization intensities. Significant information of the spectrum comprises of peaks of the intensities with proportional m/z values. However as the MS data bears high dimensionality, it implicitly demands the application of robust pattern recognition techniques that can cope up with large amounts of redundant data. Feature selection, a process of selecting a subset of original features according to certain criteria, is an important and frequently used dimensionality reduction technique for data mining [2], [3]. It reduces the number of features, removes irrelevant, redundant, or noisy data, and brings the immediate effects for applications: thereby speeding up data mining algorithms, and improving mining performance such as predictive accuracy and comprehensibility of results. In biological context, the technique is also called as discriminative gene selection, which detects influential genes based on DNA micro-array experiments. In MS analysis, feature selection plays two vital roles; Firstly it aids to construct a feature selection search which seek for significant features to discriminate diseases from control samples; Secondly it helps to construct an appropriate classification model that enables the identification of potential biomarkers for further analysis. Feature selection algorithms typically fall into two categories: feature ranking and subset selection. Feature ranking ranks the features by a metric and eliminates all features that do not achieve an adequate score. In contrast to this, subset selection searches the set of possible features for the optimal subset. That is, it evaluates a subset of features as a group for suitability. Further, subset selection algorithms can be classified into three categories viz.: Wrappers, Filters and Embedded [3]. Wrappers use a search algorithm to search through the space of possible features and evaluate each subset by running a model on the subset. Wrappers can be computationally expensive and have a risk of over fitting to the model. However, this drawback can be reduced by injecting some heuristic techniques in the search process to achieve an optimal subset. Filters are similar to Wrappers in the search approach, but instead of evaluating against a model, a simpler filter is evaluated. Filter-based feature ranking techniques rank features independently without the involvement of any learning algorithms. Feature ranking consists of scoring each feature according to a particular method, then selecting features based on their scores. Filter methods are the most commonly applied techniques in bioinformatics studies since they have proven to be computationally simple, fast and independent of other analysis algorithms. Also they allow features to be quantified and prioritized according to the scores, which is particularly important for biological 978-0-7695-4514-1/11 $26.00 © 2011 IEEE DOI 10.1109/BIC-TA.2011.7 76 2011 Sixth International Conference on Bio-Inspired Computing: Theories and Applications 978-0-7695-4514-1/11 $26.00 © 2011 IEEE DOI 10.1109/BIC-TA.2011.7 75