978-1-7281-1867-3/19/$31.00 ©2019 IEEE Capsule Network for Predicting Zinc Binding Sites in Metalloproteins Clement Essien Department of Electrical Engineering and Computer Science Christopher S. Bond Life Sciences Center University of Missouri, Columbia, MO 65211, USA u.c.essien@mail.missouri.edu Duolin Wang Department of Electrical Engineering and Computer Science Christopher S. Bond Life Sciences Center University of Missouri, Columbia, MO 65211, USA wangdu@missouri.edu Dong Xu Department of Electrical Engineering and Computer Science Christopher S. Bond Life Sciences Center University of Missouri, Columbia, MO 65211, USA xudong@missouri.edu Abstract Zinc is an important cofactor for various biological functions in plants and animals, which are usually associated with proteins. Zinc also plays an important role in protein structures to which it binds. Hence, it is important to predict the Zinc binding sites in these proteins to better understand the structures and functions of these proteins. Most of the existing tools developed in this domain are structure-based predictors implementing Support Vector Machines on datasets that are more than a decade old. As there is little work done to explore the use of deep learning frameworks in this problem, we propose ZinCaps, a framework based on the capsule network for predicting zinc binding site using sequence-only information on more recently compiled datasets. ZinCaps outperforms previous tools. Its source codes is freely available for download at https://github.com/clemEssien/ActiveSitePrediction. Keywords—zinc metal binding site; metalloproteins; deep learning; capsule network I. INTRODUCTION Many proteins interact with metals to perform certain biological functions. Metal-binding proteins known as metalloproteins require specific metal cation(s) to function properly. Such metal ions play major roles in a wide range of cellular processes and are also useful in the development of metal-based drugs such as anticancer drugs [1]. The two categories of these metals are the alkali/alkaline earth metals and transition metals. The former plays a structural role while the later plays both structure stabilization and catalysis roles [2]. Zinc is a transition metal and second to iron, it is the next leading abundant trace metal found in the human body. A 70kg adult human has about 2.3g of zinc [3]. It is required for more than 300 enzyme activities spanning all the six classes of enzymes. Zinc (Zn 2+ ) cofactor is essential for several biological functions in plants and animals. When observed in tissues, zinc is mainly associated with proteins [3]. As much as 40% of proteins in humans that bind to zinc are transcription factors while the rest which are usually enzymes/proteins are involved in ion transport [4]. Zinc has several chemical properties that lead to a variety of functions. It does not undergo redox reactions because its d-shell is filled unlike those of other first-row transition metals. As a result, it offers stability in biological environments that are characterized by fluctuating redox potentials [5]. Covalent zinc binding site is one of the most important post-translational modifications in proteins. It comprises the sulfur of cysteine, the nitrogen of histidine and/or the oxygen of aspartate and glutamate [3]. Histidine is the most observed followed by cysteine [6]. There is a correlation between the number of amino acids to which the zinc atom binds to and the activity of the metalloprotein [7]. There are three primary types of zinc sites which are; structural, catalytic and cocatalytic. The structural zinc binds to four amino acids with no bound water molecule. Cysteine (Cys) is preferred in this site. Structural zinc essentially maintains the stability of the protein tertiary structures without taking part in the biochemical reaction [8] [9]. Catalytic zinc refers to zinc that binds to three amino acid residues to form complexes with water and any three nitrogen, oxygen and sulphur donors. They are actively involved in biochemical reactions. Histidine (His) is preferred for these sites. Cocatalytic zinc sites interact with other metal ions (usually two or three) usually linked with the side-chain atom or water molecule to carry out their function [10]. Aspartate (Asp) and glutamate (Glu) are the preferred amino acids in these sites. There is also a fourth type of zinc binding site that arises from the influence of zinc on the quaternary protein structure, where zinc ions bind to one or two amino acid residues (Asp, Glu or His but no Cys) on the protein surface during crystallization. They have neither biological nor catalytic function [9]. Due to the rapid expansion of protein databases, it is becoming important to identify zinc binding sites to understand their functions in metalloproteins. This would be useful for the prediction of protein structure and function. While determining zinc binding sites using experimental techniques is laborious and costly, some attempts have been made to develop computational tools by training machine learning models for this purpose. Attempts have been made to predict zinc binding sites in metalloproteins from protein sequences. Ref. [11] generated zinc binding Cys, His, Glu and Asp predictors by training a support vector machine (SVM) classifier with position- specific scoring matrix (PSSM) obtained from PSI-BLAST. Ref. [12] presented a two-stage approach that uses SVM and recurrent neural network (RNN) in the first and second stages respectively. This predicts Cys and His being in either free or metal bound states and or disulfide bridges. While the first stage predicts the binding states, the second makes refinement by considering dependencies between the protein residues. They achieved a 73% precision and 61% recall while predicting zinc binding sites in proteins. Ref. [13] proposed ZincPred which combined SVM with homology-