Proteomic Signatures: Amino Acid and Oligopeptide Compositions Differentiate Among Phyla Itsik Pe’er, Clifford E. Felder, 2 Orna Man, 1,2 Israel Silman, Joel L. Sussman, 2† * and Jacques S. Beckmann 1‡ * 1 Department of Molecular Genetics, Weizmann Institute of Science, Rehovot, Israel 2 Department of Structural Biology, Weizmann Institute of Science, Rehovot, Israel 3 Department of Neurobiology, Weizmann Institute of Science, Rehovot, Israel ABSTRACT Availability of complete genome sequences allows in-depth comparison of single- residue and oligopeptide compositions of the corre- sponding proteomes. We have used principal compo- nent analysis (PCA) to study the landscape of compositional motifs across more than 70 genera from all three superkingdoms. Unexpectedly, the first two principal components clearly differentiate archaea, eubacteria, and eukaryota from each other. In particular, we contrast compositional patterns typical of the three superkingdoms and character- ize differences between species and phyla, as well as among patterns shared by all compositional pro- teomic signatures. These species-specific patterns may even extend to subsets of the entire proteome, such as proteins pertaining to individual yeast chro- mosomes. We identify factors that affect composi- tional signatures, such as living habitat, and detect strong eukaryotic preference for homopeptides and palindromic tripeptides. We further detect oligopep- tides that are either universally over- or underabun- dant across the whole proteomic landscape, as well as oligopeptides whose over- or underabundance is phylum- or species-specific. Finally, we report that species composition signatures preserve evolution- ary memory, providing a new method to compare phylogenetic relationships among species that avoids problems of sequence alignment and or- tholog detection. Proteins 2004;54:20 – 40. © 2003 Wiley-Liss, Inc. Key words: phylogenetics; principal component analysis; proteome composition INTRODUCTION Over 100 genomes have been fully sequenced to date, providing an opportunity for comprehensive comparison and analysis of their organization, similarity, uniqueness, and variability at the sequence level. Comparative analy- sis of the proteomes derived from these genomes has already proven powerful in gene identification and in prediction of structure, function, and active sites of pro- teins, as well as in phylogenetic analysis. These analyses are usually based on a per-sequence comparison (e.g., see Gribaldo and Philippe 1 ). However, such studies suffer from a major difficulty imposed by the requirement for orthologs of the analyzed proteins from all compared species. Even when orthologs are present, their detection often is prone to error. Furthermore, the resemblance of a specific (or putative) protein may not be representative of species relatedness because of ancestral gene duplication, pseudogenization, or lateral gene transfer (LGT). Finally, the success of such comparative analyses greatly depends on the quality of the sequence alignment, which is hard to control automatically for large data sets. With available complete-genome data sets, one can pursue complementary per proteome approaches and ad- dress general, global properties. In contrast to gene-based approaches, whole-proteome analyses can be performed in the absence of any ortholog knowledge of the encoded products. Indeed, such approaches can be powerful, as illustrated by recent studies. 2–5 Prominent approaches of this type consider high-level organization, such as chromosomal gene order and compo- Grant sponsor: Israel Ministry of Science and Technology Grant for Interdisciplinary Studies Grant sponsor: The Israel Ministry of Science and Technology grant for the Israel Structural Proteomics Center. Grant sponsor: European Commission Fifth Framework “Quality of Life and Management of Living Resources” program under contract number: QLK3-2000-00650 Grant sponsor: European Commission Fifth Framework “Quality of Life and Management of Living Resources” ‘SPINE’ Project; Grant number: QLG2-CT-2002-00988 Grant sponsor: Helen & Milton A. Kimmelman Center for Biomolec- ular Structure and Assembly (Rehovot, Israel) Grant sponsor: Benoziyo Center for Neurosciences (Rehovot, Israel). Grant sponsor: Kalman and Ida Wolens Foundation (Rehovot, Israel). Grant sponsor: Jean and Jula Goldwurm Memorial Foundation (Rehovot, Israel). Grant sponsor: Divadol Foundation (Rehovot, Israel). Itsik Pe’er is a recipient of the ESHKOL fellowship by the Israeli Ministry of Science and Technology. § Israel Silman is the Bernstein–Mason Professor of Neurochemis- try. †Joel L. Sussman is the Morton and Gladys Pickman Professor of Structural Biology. ‡Jacques S. Beckmann is the Hermann Mayer Professor of Molecu- lar Genetics. *Correspondence to: Joel L. Sussman, Department of Structural Biology, Weizmann Institute of Science, Rehovot, 76100 Israel. E- mail: joel.sussman@weizmann.ac.il, or Jacques S. Beckmann, Depart- ment of Molecular Genetics, Weizmann Institute of Science, Rehovot, 76100 Israel. E-mail: jacqui.beckmann@weizmann.ac.il Received 8 April 2003; Accepted 17 June 2003 PROTEINS: Structure, Function, and Bioinformatics 54:20 – 40 (2004) © 2003 WILEY-LISS, INC.