Ortho-proteogenomics: Multiple proteomes
investigation through orthology and a new
MS-based protocol
Sébastien Gallien,
1,8
Emmanuel Perrodou,
2,3,4,5
Christine Carapito,
1
Caroline Deshayes,
6,7
Jean-Marc Reyrat,
6,7
Alain Van Dorsselaer,
1
Olivier Poch,
2,3,4,5
Christine Schaeffer,
1
and Odile Lecompte
2,3,4,5
1
Laboratoire de Spectrométrie de Masse Bio-Organique, IPHC-DSA, ULP, CNRS, UMR7178, 67 087 Strasbourg, France;
2
Department of Structural Biology and Genomics, Institut de Génétique et de Biologie Moléculaire et Cellulaire (IGBMC), F-67400
Illkirch, France;
3
INSERM, U596, F-67400 Illkirch, France;
4
CNRS, UMR7104, F-67400 Illkirch, France;
5
Faculté des Sciences de la
Vie, Université Louis Pasteur, F-67000 Strasbourg, France;
6
Faculté de Médecine René Descartes, Université Paris Descartes, Paris
Cedex 15, F-75730, France;
7
INSERM, U570, Unité de Pathogénie des Infections Systémiques, Paris Cedex 15, F-75730, France
The progress in sequencing technologies irrigates biology with an ever-increasing number of genome sequences. In
most cases, the gene repertoire is predicted in silico and conceptually translated into proteins. As recently
highlighted, the predicted genes exhibit frequent errors, particularly in start codons, with a serious impact on
subsequent biological studies. A new “ortho-proteogenomic” approach is presented here for the annotation
refinement of multiple genomes at once. It combines comparative genomics with an original proteomic protocol that
allows the characterization of both N-terminal and internal peptides in a single experiment. This strategy was applied
to the Mycobacterium genus with Mycobacterium smegmatis as the reference, and identified 946 distinct proteins, including
443 characterized N termini. These experimental data allowed the correction of 19% of the characterized start
codons, the identification of 29 proteins missed during the annotation process, and the curation, thanks to
comparative genomics, of 4328 sequences of 16 other Mycobacterium proteomes.
[Supplemental material is available online at www.genome.org.]
The increasing availability of data from multiple genome se-
quencing projects provides biologists with an invaluable frame-
work to integrate experimental results and design new experi-
ments at different scales. However, several recent studies have
highlighted the prevalence of gene prediction errors, even in the
“simple” prokaryotic genomes. Genome sequencing itself repre-
sents a non-negligible source of errors (Weinstock 2000), but de-
spite major advances, most inconsistencies result from in silico
predictions (Galperin et al. 1998). Among these errors, the incor-
rect prediction of initiation codons in prokaryotic genomes is
particularly widespread (Aivaliotis et al. 2007). For instance, error
rates in start codon prediction vary from 10% to 44% in Halo-
bacterium salinarum and Natromonas pharaonis (Aivaliotis et al.
2007), depending on the gene prediction program used. This
reality is often underestimated or even ignored by biologists,
even though the correct definition of genes is determinant for
subsequent in silico and experimental studies. For example, by
altering the definition of the coding sequence of a gene, an er-
roneous start codon can hamper the detection of regulatory mo-
tifs on the genome or even mask another gene in a compact
genome (Salgado et al. 2000; Edwards et al. 2005). Moreover, the
protein sequence itself can be either truncated or extended, lead-
ing to errors in bioinformatics protein characterization (func-
tion, localization, etc.) and, obviously, to major difficulties in
protein expression experiments (Trivedi et al. 2004; Horie et al.
2007). The second highly prejudicial error encountered in pro-
karyotic genome annotation is under-prediction of small genes
or genes exhibiting an unusual composition. The accumulation
of erroneous information in genomic and protein databases will
continue to grow since features are frequently transferred from
annotated to unknown sequences (Doerks et al. 1998), which
only amplifies the errors.
To break this vicious circle and to cope with the multiplica-
tion of prokaryotic genome data, including many projects aimed
at exploring genetic diversity within a genus or a species by mul-
tiple-strain sequencing (Liolios et al. 2008), one cannot rely
solely on manual curation. In this context, the proteogenomic
approach, i.e., annotation refinement through proteomics, is
promising and has already been used to investigate several bac-
terial genomes (Jaffe et al. 2004a,b; Wang et al. 2005; Gupta et al.
2007, 2008), revealing the expression of genes annotated as pseu-
dogenes as well as some completely missed genes or some errors
in start codon annotation. However, these high-throughput
studies do not focus on the N-terminal identification of proteins,
limiting the correction of gene boundaries. In contrast, other
methods have aimed at the specific identification of N-terminal
peptides from the digest of a protein extract (Gevaert et al. 2003;
McDonald et al. 2005; McDonald and Beynon 2006), but these
methods imply the loss of all internal peptides, which is a major
drawback both for protein and proteome coverage.
Here, we report an original strategy coupling a new N-
8
Corresponding author.
E-mail sgallien@chimie.u-strasbg.fr; fax 33-3-90-24-27-81.
Article published online before print. Article and publication date are at http://
www.genome.org/cgi/doi/10.1101/gr.081901.108.
Methods
128 Genome Research
www.genome.org
19:128–135 ©2009 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/09; www.genome.org