Protein complex similarity based on 1 Weisfeiler-Lehman labeling 2 Bianca K. St ¨ ocker 1,2 , Till Sch¨ afer 4 , Petra Mutzel 4 , Johannes K ¨ oster 1,2,3 , 3 Nils Kriege 4 , and Sven Rahmann 1,4 4 1 Genome Informatics, Institute of Human Genetics, University Hospital Essen, 5 University of Duisburg-Essen, 45147 Essen, Germany 6 2 Algorithms for Reproducible Bioinformatics, Genome Informatics, Institute of Human 7 Genetics, University Hospital Essen, University of Duisburg-Essen, 45147 Essen, 8 Germany 9 3 Medical Oncology, Dana-Farber Cancer Institute, Harvard Medical School, Boston 10 MA 02215, USA 11 4 Department of Computer Science, TU Dortmund University, 44221 Dortmund, Germany 12 Corresponding author: 13 Sven Rahmann 1 14 Email address: Sven.Rahmann@uni-due.de 15 ABSTRACT 16 Being able to quantify the similarity between two protein complexes is essential for numerous applications. Prominent examples are database searches for known complexes with a given query complex, comparison of the output of different protein complex prediction algorithms, or summarizing and clustering protein complexes, e.g., for visualization. While the corresponding problems have received much attention on single proteins and protein families, the question about how to model and compute similarity between protein complexes has not yet been systematically studied. Because protein complexes can be naturally modeled as graphs, in principle general graph similarity measures may be used, but these are often computationally hard to obtain and do not take typical properties of protein complexes into account. Here we propose a parametric family of similarity measures based on Weisfeiler-Lehman labeling. We evaluate it on simulated complexes of the extended human integrin adhesome network. Because the connectivity (graph topology) of real complexes is often unknown and hard to obtain experimentally, we use both known protein-protein interaction networks and known interdependencies (constraints) between interactions to simulate more realistic complexes than from interaction networks alone. We empirically show that the defined family of similarity measures is in good agreement with edit similarity, a similarity measure derived from graph edit distance, but can be much more efficiently computed. It can therefore be used in large-scale studies and simulations and serve as a basis for further refinements of modeling protein complex similarity. 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 INTRODUCTION 34 Proteins fulfill manifold tasks in living cells, but they rarely act alone. Indeed, most cellular functions 35 are enabled only when proteins physically interact with other proteins, forming protein complexes. DNA 36 transcription is a typical example, where RNA polymerase II, general transcription factors, cell type 37 specific transcription regulators and mediator proteins interact. 38 Understanding protein complex formation and function is one of the big challenges of cell biology, 39 approached by both experimental techniques and computational modeling. While the constituent protein 40 sequences can be obtained from the genome (even that can be challenging), the computational prediction of 41 real protein complexes from protein interaction networks appears to be much more difficult as evidenced by 42 the recent literature on the topic; see Bhowmick and Seah (2016) for a survey, or Srihari et al. (2017) for a 43 textbook introduction. Fortunately, new experimental technologies are about to enhance our understanding 44 of complexes significantly in the near future, e.g. high-resolution protein-protein docking (Park et al., 45 2015; Vakser, 2014; Kozakov et al., 2017; Wass et al., 2011). Large scale generation of libraries of cell 46 PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.26612v1 | CC BY 4.0 Open Access | rec: 3 Mar 2018, publ: 3 Mar 2018