Current Proteomics, 2007, 4, 121-130 121 1570-1646/07 $50.00+.00 ©2007 Bentham Science Publishers Ltd. Current Status of Computational Approaches for Protein Identification Using Tandem Mass Spectra Xiang Zhang * , Cheolhwan Oh, Catherine P. Riley and Charles Buck Bindley Bioscience Center, Purdue University, West Lafayette, IN 47906, USA Abstract: Proteomics is a still-evolving combination of technologies to describe and characterize all expressed proteins in a biological system. Because of upper limits on mass detection of mass spectrometers, the bottom-up approach is most widely employed in which tryptic peptides are quantified and identified from complex protein mixtures. Protein identifi- cation from tandem mass spectra is still a challenge in proteomics. Two approaches have been developed to identify pro- teins from tandem mass spectra, database searching and de novo sequencing. These approaches typically have positive identification rates of only ~10-20%, and exhibit high false positive identification rates. This review surveys existing algo- rithms developed for database searching and de novo sequencing, with a focus on recent developments for tandem mass spectrum quality assessment, peptide identification using annotated spectra libraries, statistical approaches to assess iden- tification quality, and methods for constrained searches. We also review research comparing the performance of existing protein identification packages. Key Words: Protein identification, mass spectrometry, database searching, de novo sequencing, data preprocessing, intensity based identification. 1. INTRODUCTION The purpose of proteomics research is to understand liv- ing systems at the protein level by systematically identifying and quantifying all proteins expressed in the biological sys- tem at a particular time and under specific conditions. In recent years, tandem mass spectrometry has become a stan- dard tool for protein identification (Aebersold and Mann, 2003; Lambert et al., 2005). Due to the complexity of the proteome, a major effort in proteomics research is devoted to fractionation of proteins and peptides prior to mass spec- trometry (MS). Typically, proteins are digested into short peptides with enzymes like trypsin that act at specific sites along the backbone of the protein. Liquid chromatography (LC) is commonly used to simplify peptide mixtures to the extent required for mass spectrometry analysis of the com- ponent peptides, and multiple LC columns may be employed in tandem (Hattan et al., 2005; Qiu et al., 2007) prior to analysis of the fractionated peptide mixture by MS. The in- dividual peptides are usually isolated and further fragmented by collision-induced dissociation (CID) during MS analysis to produce information about the mass-to-charge ratios (m/z) of resulting fragment ions (MS/MS of parent peptides). In a single experiment, many charged fragments are formed by CID of multiple copies of the same peptide. Since peptides usually break at a peptide-bond with CID, the resulting tan- dem mass spectrum (MS/MS) contains information about the amino acid sequence of peptide. Thousands of MS/MS spec- tra are generated in a single proteomics experiment. The goal of protein identification is to re-construct the peptide sequence from these MS/MS spectra and assign the *Address correspondence to this author at the Bindley Bioscience Center, Purdue University, West Lafayette, IN 47906, USA; Tel: (765)496-1153; Fax: (765)496-1518; E-mail: zhang100@purdue.edu identified peptides to proteins from which they were gener- ated by enzymatic digestion. This analysis process consists of three distinct components, a peptide search engine, and both peptide and protein validation. The peptide search en- gine finds a list of peptide candidates for each tandem spec- trum. Each of these peptide candidates is ranked using a spe- cific scoring method. Peptide validation determines the con- fidence of a peptide identification provided by the search engine and finally, protein validation infers the protein iden- tity based on the selected peptides that achieve the specified criteria for peptide validation. Information contained in MS/MS spectra is the basis for protein identification. However, this information is generally incomplete in terms of re-constructing peptide sequences. Additionally, some spectral peaks may result from breakage of non-peptide bonds and these introduce chemical ‘noise’ in the analysis. Co-eluted peptides with similar m/z ratios fur- ther complicate the tandem spectra. These confounds and others make reliable interpretation of peptide MS/MS spectra a great challenge. Two methods are currently deployed for protein identification, database searching and de novo se- quencing. The database searching approach attempts to iden- tify proteins by comparing the experimental tandem spectra with predicted spectrum of each candidate peptide generated in-silico from protein sequences recorded in protein data- base. The de novo sequencing approach identifies the amino acid sequence of peptides directly from the tandem mass spectrum. We here attempt to review protein identification infor- matics. We will survey existing algorithms developed for database searching and de novo sequencing, with a focus on recent developments for tandem spectrum quality assess- ment, peptide identification using annotated spectra libraries, statistical approaches to assess identification quality, and