L A T E X Tik Zposter Towards discovering key factors in prediction of post-translational modifications Marcin Tatjewski 1,2 , Julian Zubek 1,2 , Marcin Kierczak 3 , Dariusz Plewczyński 2 1. Institute of Computer Science, Polish Academy of Sciences, Warsaw, Poland. 2. Centre of New Technologies, University of Warsaw, Warsaw, Poland. 3. Science for Life Laboratory, Department of Medical Biochemistry and Microbiology, Uppsala University, Uppsala, Sweden. Towards discovering key factors in prediction of post-translational modifications Marcin Tatjewski 1,2 , Julian Zubek 1,2 , Marcin Kierczak 3 , Dariusz Plewczyński 2 1. Institute of Computer Science, Polish Academy of Sciences, Warsaw, Poland. 2. Centre of New Technologies, University of Warsaw, Warsaw, Poland. 3. Science for Life Laboratory, Department of Medical Biochemistry and Microbiology, Uppsala University, Uppsala, Sweden. Post-translational modifications PTMs are processes of attaching biochemical functional groups (e.g. phosphate, acetate, lipids) to amino acids that change the chemical nature or structure of the protein and thus extend its range of functions. PTMs may occur at different stages of a protein life cycle. Approaches to PTM prediction Protein P62262 PTM sites extraction . . . EERNLLSVAY KNVIGARRASW . . . amino acids translated into physicochemical attribute values Feature set Machine Learning algorithms Predictor Key design questions [3] 1. How big size of PTM site neighbourhood sequence should be used? 2. Which and how many attributes should be used for amino acid feature encoding? Our dataset PTM occurrence data extracted from UniProtKB, grouped by PTM type into 278 sets. With phosphorylations additionally divided into groups by the kinase type which catalysed the reaction. PTM type # positive examples Phosphoserine 39616 of which by PKA 388 of which by PKC 350 of which by CK2 246 ... ... N6-acetyllysine 8328 Phosphothreonine 8210 of which by PKC 111 ... ... N6-succinyllysine 2558 Selenocysteine 2071 ... ... All 106226 Negative examples were constructed using the following approach: a)in proteins that have at least one PTM site marked in UniProt, b) relevant amino acids were randomly selected as negative sites, c) if the sites themselves were not annotated as PTM sites. Role of feature selection To answer both key design questions mentioned above we decided to employ Monte Carlo Feature Selection Method (MCFS) [1]. 1. We use wide sequence neighbourhood window size of 21. 2.For each amino acid we produce 531 features - using all physicochemical descriptors. 3. On 11151 features per PTM site record we run MCFS to decide which attributes on which window positions are crucial for determining PTM occurrence. 4.We use 168 best features selected by MCFS for predictor training. Number 168 was chosen to align with the HQI8 method. Some amino acids from the window might not be used at all in the predictor building phase. Method and results On the plot below, predictors built using previously described feature selection procedure (labelled as MCFS) are compared to approaches that have fixed sets of physicochemical attributes per each amino acid: Attributes selected with clustering - (labelled as HQI8) representatives of 8 clusters computed over the whole set of physicochemical descriptors. Design developed for the Auto Motif Server [2]. Expert-chosen attributes - (labelled as Kierczak7) chosen from the full set of physico- chemical descriptors using expert knowledge. Conclusions 1. Model saturation can be observed. All three analysed methods have comparable results. They achieve results only around 0.07 higher AUC than use of raw sequence. Therefore, feature selection can hardly be used for improving prediction results, yet it can be utilized for identifying which particular attributes and positions are important for a predictor. 2. Increasing the window from 9 to 21 gives at maximum 0.02 AUC increase. Feature importance Feature importance map for expert-chosen physicochemical descriptors and sequence neighbourhood window size of 9. References [1] MichałDramiński, Alvaro Rada-Iglesias, Stefan Enroth, Claes Wadelius, Jacek Koronacki, and Jan Komorowski. Monte carlo feature selection for supervised classification. Bioinformatics, 24(1):110–117, January 2008. [2] Dariusz Plewczynski, Subhadip Basu, and Indrajit Saha. Ams 4.0: consensus prediction of post-translational modifications in protein sequences. Amino Acids, 43(2):573–582, 2012. [3] Brett Trost and Anthony Kusalik. Computational prediction of eukaryotic phosphorylation sites. Bioinformatics, 27(21):2927–2935, November 2011. Project is co-founded by the European Union from resources of the European Social Fund. Project PO KL “Information technologies: Research and their interdisciplinary applications”.