I NTRODUCING CARATE: F INALLY S PEAKING CHEMISTRY . APREPRINT Julian M. Kleber * Freie Universität Berlin Stettiner Straße, 13357 Berlin julian.kleber@fu-berlin.de February 16, 2022 ABSTRACT Computer-Aided Drug Design is advancing to a new era. Recent developments in statistical modelling, including Deep Learning, Machine Learning and high throughput simulations, enable workﬂows and deductions not achievable 20 years ago. The key interaction for many small molecules is via bio-molecules. The interaction between a small molecule and a biological system therefore manifests itself at multiple time and length scales. While the human chemist quite intuitively grasps the concept of multiple scales, most of the computer technologies do not relate multiple scales easily. Therefore, numerous methods in the realm of computational sciences have been developed. However, up to now it was not clear that the problem of multiple scales is not only a mere matter of computational abilities but even more a matter of accurate representation. Because of the amount of already performed simulations at various scales, deep learning (DL) becomes a viable approach to speed up computations by approximating simulations in a data-driven way. However, to accurately approximate the physical (simulated) properties of a given compound, an accurate, uniform representation is mandatory. Therefore, the biochemical and pharmaceutical encoder (CARATE) is introduced. Furthermore, the regression and classiﬁcation abilities of CARATE are evaluated against benchmarking datasets (ZINC, ALCHEMY, MCF-7, MOLT-4, YEAST, Enzymes, Proteins) and compared to other baseline approaches. Keywords Multi Head Self Attention, Graph Neural Networks, Regression, Quantum Chemistry, Computer Aided Drug Design, Computational Science, Deep Learning 1 Introduction The wish to accurately predict the outcome of a drug development process is as old as the regulated pharmaceutical industry itself. However, the task of simulating a drug interaction in the body is a multiscale problem operating on many scales. Up to now, no accurate method has been found to model the whole life cycle of an API inside a mammal. The work on multiscale problems focuses on different aspects of the systems at multiple scales and tries to represent the important aspects of a particular scale accurately by the means of that particular scale. The simulator decides on unimportant parts of a scale and approximates them more efﬁciently with methods of a larger scale[1]. When different length and time scales are involved, computational expensive methods for the small length scale may become infeasible. For example, to model the interaction of a small molecule with a receptor, one needs to combine quantum chemistry (QC) with classical molecular mechanics (MM). The computations performed by the QM/MM approximation still have the QC calculations as a bottleneck. However, recent advances in deep learning make the prediction of a particular simulation scale accessible[2, 3]. Also, the prediction of quantum chemical properties with deep learning methods are accessible[4]. To speed up the computation, of a QM/MM simulations, the QC computations should be approximated by predicting the desired properties via machine learning methods[5]. * Master Student at Freie Universität Berlin . CC-BY-NC-ND 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted February 16, 2022. ; https://doi.org/10.1101/2022.02.12.470636 doi: bioRxiv preprint