Current Proteomics, 2007, 4, 121-130 121
1570-1646/07 $50.00+.00 ©2007 Bentham Science Publishers Ltd.
Current Status of Computational Approaches for Protein Identification
Using Tandem Mass Spectra
Xiang Zhang
*
, Cheolhwan Oh, Catherine P. Riley and Charles Buck
Bindley Bioscience Center, Purdue University, West Lafayette, IN 47906, USA
Abstract: Proteomics is a still-evolving combination of technologies to describe and characterize all expressed proteins in
a biological system. Because of upper limits on mass detection of mass spectrometers, the bottom-up approach is most
widely employed in which tryptic peptides are quantified and identified from complex protein mixtures. Protein identifi-
cation from tandem mass spectra is still a challenge in proteomics. Two approaches have been developed to identify pro-
teins from tandem mass spectra, database searching and de novo sequencing. These approaches typically have positive
identification rates of only ~10-20%, and exhibit high false positive identification rates. This review surveys existing algo-
rithms developed for database searching and de novo sequencing, with a focus on recent developments for tandem mass
spectrum quality assessment, peptide identification using annotated spectra libraries, statistical approaches to assess iden-
tification quality, and methods for constrained searches. We also review research comparing the performance of existing
protein identification packages.
Key Words: Protein identification, mass spectrometry, database searching, de novo sequencing, data preprocessing, intensity
based identification.
1. INTRODUCTION
The purpose of proteomics research is to understand liv-
ing systems at the protein level by systematically identifying
and quantifying all proteins expressed in the biological sys-
tem at a particular time and under specific conditions. In
recent years, tandem mass spectrometry has become a stan-
dard tool for protein identification (Aebersold and Mann,
2003; Lambert et al., 2005). Due to the complexity of the
proteome, a major effort in proteomics research is devoted to
fractionation of proteins and peptides prior to mass spec-
trometry (MS). Typically, proteins are digested into short
peptides with enzymes like trypsin that act at specific sites
along the backbone of the protein. Liquid chromatography
(LC) is commonly used to simplify peptide mixtures to the
extent required for mass spectrometry analysis of the com-
ponent peptides, and multiple LC columns may be employed
in tandem (Hattan et al., 2005; Qiu et al., 2007) prior to
analysis of the fractionated peptide mixture by MS. The in-
dividual peptides are usually isolated and further fragmented
by collision-induced dissociation (CID) during MS analysis
to produce information about the mass-to-charge ratios (m/z)
of resulting fragment ions (MS/MS of parent peptides). In a
single experiment, many charged fragments are formed by
CID of multiple copies of the same peptide. Since peptides
usually break at a peptide-bond with CID, the resulting tan-
dem mass spectrum (MS/MS) contains information about the
amino acid sequence of peptide. Thousands of MS/MS spec-
tra are generated in a single proteomics experiment.
The goal of protein identification is to re-construct the
peptide sequence from these MS/MS spectra and assign the
*Address correspondence to this author at the Bindley Bioscience Center,
Purdue University, West Lafayette, IN 47906, USA; Tel: (765)496-1153;
Fax: (765)496-1518; E-mail: zhang100@purdue.edu
identified peptides to proteins from which they were gener-
ated by enzymatic digestion. This analysis process consists
of three distinct components, a peptide search engine, and
both peptide and protein validation. The peptide search en-
gine finds a list of peptide candidates for each tandem spec-
trum. Each of these peptide candidates is ranked using a spe-
cific scoring method. Peptide validation determines the con-
fidence of a peptide identification provided by the search
engine and finally, protein validation infers the protein iden-
tity based on the selected peptides that achieve the specified
criteria for peptide validation.
Information contained in MS/MS spectra is the basis for
protein identification. However, this information is generally
incomplete in terms of re-constructing peptide sequences.
Additionally, some spectral peaks may result from breakage
of non-peptide bonds and these introduce chemical ‘noise’ in
the analysis. Co-eluted peptides with similar m/z ratios fur-
ther complicate the tandem spectra. These confounds and
others make reliable interpretation of peptide MS/MS spectra
a great challenge. Two methods are currently deployed for
protein identification, database searching and de novo se-
quencing. The database searching approach attempts to iden-
tify proteins by comparing the experimental tandem spectra
with predicted spectrum of each candidate peptide generated
in-silico from protein sequences recorded in protein data-
base. The de novo sequencing approach identifies the amino
acid sequence of peptides directly from the tandem mass
spectrum.
We here attempt to review protein identification infor-
matics. We will survey existing algorithms developed for
database searching and de novo sequencing, with a focus on
recent developments for tandem spectrum quality assess-
ment, peptide identification using annotated spectra libraries,
statistical approaches to assess identification quality, and