An Application of the Metric Access Methods to the Mass Spectrometry Data Jiˇ ı Nov´ ak and David Hoksza Abstract— Mass spectrometry is a very popular method for protein and peptide identification nowadays. Abundance of data generated in this way grows exponentially every year. Although there exist algorithms for interpreting mass spectra, demand for faster and more accurate approaches remains. We propose an approach for preprocessing the protein se- quence database based on metric access methods. This approach allows to select only a small set of suitable peptide sequence candidates, which can be then compared with experimental spectra using more sophisticated algorithms. We define loga- rithmic distance for selecting peptide sequence candidates and also outline possibilities of using the interval query for searching posttranslational modifications. The experimental results show that our approach is compa- rable in precision with nowadays most widely used public tools and outline possible directions for further resarch. I. I NTRODUCTION M ASS SPECTROMETRY [10] is a modern and fast method for protein sequences identification. The knowledge of protein sequences is important for study- ing protein functions, that are based on protein structure determined by the sequences [20]. Mass spectrometry is also denoted as an indirect sequencing, because the mass spectrometer does not determine a linear sequence of amino acids (protein’s primary structure), but it captures a collection of uninterpreted data called mass spectrum. Subsequently, the protein sequence must be elucidated from the mass spectrum 1 by some sophisticated algorithm first. Before the mass spectrometry analysis, an unknown pro- tein sample (number of molecules with identical structure) is digested by a specific enzyme into many shorter peptides. After the protein digestion at specific cleavage sites the sample is analyzed by the mass spectrometer. Single peptide molecules get charges (becoming to be ions) and they are separated by their ratios mass/charge (m/z). In case of simple mass spectrometry (MS), the mass spectrum is a list of detected ratios m/z with intensities of their occurence (list of peaks). Moreover, in case of tandem mass spectrometry (MS/MS), each peptide ion can be splitted to various types of fragment ions, hence for one analyzed protein we get a collection of the tandem mass spectra (one spectrum for each peptide ion) [15]. The peptide ion common Jiˇ ı Nov´ ak, Czech Technical University in Prague, Faculty of Electrical Engineering, Czech Republic, email: jirinovak@atlas.cz and David Hoksza, Charles University in Prague, Faculty of Mathematics and Physics, Czech Republic, email: david.hoksza@mff.cuni.cz. This work was supported by GAUK grant under project code 57907 and GACR grant under project code 201/09/0683. 1 In the case of tandem mass spectrometry from the collection of the mass spectra. Fig. 1. The most common types of peptide fragment ions in a tandem mass spectrum (AA = amino acid). to all fragment ions in a tandem spectrum is denoted as parent peptide ion. The most common types of fragment ions y, b and a are presented in the Fig. 1. Two basic approaches are used for interpreting mass spec- tra. First is based on the direct spectra interpretation using graph algorithms (De Novo peptide sequencing) [7] and it is typical for tandem mass spectra. Second is based on scanning databases of already known protein or peptide sequences, which is denoted Peptide Mass Fingerprinting (PMF) [9] for MS spectra and Peptide Fragment Fingerprinting (PFF) [13] for MS/MS spectra. A combined approach Sequence Tag [16] can be used for tandem mass spectra, where a short aminoacid sequence (tag) is determined manually or using graph algorithm first and then the database is searched. The interpretation of experimental spectra is complicated, because many peaks (up to 80%) cannot be usually recog- nized. These unrecognizable peaks (called noise) correspond to ions with unpredictable structure. The noise arises from ions losing complex chemical groups, from ions falling into many parts or it can be a consequence of admixtures in the analyzed sample, etc. Moreover, not all of the theoret- ically generated ions (e.g. y or b-ions which are the most important ones for peptide identification) need to have their counterparts in the tandem mass spectra. Previous obstacles (noise, absence of some important peaks) makes De Novo identification very difficult, if not in practice impossible, task. On the other hand, a substantial drawback of the database approach is the need of existence of the analyzed or related sequence in the database. Databases often result from translation of known DNA sequences, hence unknown protein sequences can be identi- fied. The identification of protein sequences is complicated, if the proteins were modified after translation (posttranslation modifications, PTM) or during preparation the sample for mass analysis, because the peaks can be shifted.