Electrophoresis 1998, zyxwvutsrqpon 19, 3199-3206 zyxwvutsrqpo Multipanneter zyxwv cross-species protein identification 3 199 Marc R. Wi1kinslJ Elisabeth Gasteiger2 Colin H. Wheeler3 Ingrid Lindskog4 Jean-Charles Sanchez’ Amos Bairoch2 Ron D. Appe14 Michael J. Dunn3 Denis F. Hochstrasser1’2 ‘Central Clinical Chemistry Laboratory, Geneva University Hospital, Geneve, Switzerland 2Medical Biochemistry Department, University of Geneva, Geneve, Switzerland 3Department of Cardiothoracic Surgery, National Heart and Lung Institute, Imperial College School of Medicine, Heart Science Centre, Harefield Hospital, Harefield, UK 4Swiss Institute of Bioinformatics, Geneve, Switzerland 1 Introduction Multiple parameter cross-species protein identification using MultiIdent zyxw - a world-wide web accessible tool Recent increases in the number of genome sequencing projects means that the amount of protein sequence in databases is increasing at an astonishing pace, In proteome studies, this is facilitating the identification of proteins from molecularly well-defined organisms. However, in studies of proteins from the majority of organisms, proteins must be identified by comparing analytical data to sequences in databases from other species. This process is known as cross-species protein identification. Here we present a new program, MultiIdent, which uses multiple protein parameters such as amino acid composition, peptide masses, sequence tags, estimated protein pZ and mass, to achieve cross-species protein identification. The program is structured so that protein amino acid composition, which is highly conserved across species boundaries, first generates a set of candidate proteins. These proteins are then queried with other protein parameters such as sequence tags and peptide masses. A final list of database entries which considers all analytical parameters is presented, ranked by an integrated score. We illustrate the power of the approach with the identification of a set of standard proteins, and the identification of proteins from dog heart separated by two-dimensional gel electrophoresis. The MultiIdent program is available on the world-wide web at: http://www zyxwvu .expasy.ch/sprot/multiident.html. Proteome projects involve the identification and character- isation of large numbers of proteins in an organism [ l a ] . Frequently proteins are separated by two-dimensional gel electrophoresis, followed by the application of protein identification techniques such as microsequencing, “tag” sequencing, amino acid composition, peptide mass finger- printing, or mass spectrometry sequencing (reviewed in zyxwv [5]). There is an impressive array of computer programs available to assist making these identifications - many of which are available on the world-wide web. As the sequencing of genes and genomes is advancing at an astonishing pace, one might expect that it is becoming increasingly easy to identify a protein, or large numbers of proteins, with high confidence. This is certainly true for organisms whose genomes are sequenced and available in public databases such as zyxwvutsrq Haemophilus influenzae and Succharomyces cerevisiae [6-71. However it is not widely appreciated that the bulk of information in protein sequence databases comes from a small number of species. For example, 48% of entries in release 34 of the SWISS-PROT database [8] come from just 20 organisms. Thus the researcher working with a less popular (or a poorly molecularly characterised) organism may face difficulties if undertaking large-scale protein identifications, as identi- Correspondence: Dr. Marc R. Wilkins, Macquarie University Centre for Analytical Biotechnology and Australian Proteome Analysis Facility, Macquarie University NSW 2109 Australia (Tel: +61-2-9850-6267; Fax: +61-2-9850-8174; E-mail mwilkins@proteome.org.au) Keywords: Protein identification / Two-dimensional polyacrylamide gel electrophoresis / Peptide mass fingerprinting / Sequence tag / Amino acid composition / Proteomics fication will need to be done by comparing analytical results across species boundaries. Cross- species identification of proteins from 2-D gels remains a challenge. Clearly the best way of achieving confident cross-species identification would be to generate extensive primary sequence data. If this is done by Edman degradation, it is expensive to do on a large scale and proteins can be blocked at their amino termini, thus yielding no sequence. Tandem mass spectrometric sequencing techniques could be applied using ion trap, triple quadru- pole or quadrupole time-of-flight apparatus zy (e.g. [S]), however, the sequencing of peptides de novo, as opposed to the assigning of sequence by matching fragmentation data against peptides in a database, remains difficult. As an alternative, it has been shown that cross-species protein identification can be undertaken with the parameters of protein amino acid composition and peptide masses [lo, 111. Detailed theoretical studies have been undertaken to verify the efficacy of this approach [123, and efforts have been made to define how these parameters are conserved and thus best used across species boundaries [13, 141. This method is yet to be widely adopted, at least partly because there has been no widely available database matching tool that can accept this variety of analytical parameters. Here we describe an advanced computer program, “Multi- Ident”, for the identification of proteins using multiple protein parameters of estimated mass and pZ, amino acid composition, peptide masses, and protein sequence tags of six amino acids or less. In its current form, this program is optimised for the identification of proteins across species boundaries, against the SWISS-PROT protein database [15]. We illustrate the utility of this program with the cross- species identification of standard proteins, and dog heart proteins separated by 2-D gel electrophoresis. 0 WILEY-VCH Verlag GmbH, 69451 Weinheim, 1998 0173-0835/98/1818-3199 $17.50+.50/0