MoleHD: Drug Discovery using Brain-Inspired Hyperdimensional Computing Dongning Ma, Rahul Thapa, Xun Jiao Villanova Univeristy {dma2, rthapa, xun.jiao}@villanova.edu Abstract Modern drug discovery is often time-consuming, complex and cost-ineffective due to the large volume of molecular data and complicated molecular properties. Recently, ma- chine learning algorithms have shown promising results in virtual screening of automated drug discovery by predicting molecular properties. While emerging learning methods such as graph neural networks and recurrent neural networks ex- hibit high accuracy, they are also notoriously computation- intensive and memory-intensive with operations such as fea- ture embeddings or deep convolutions. In this paper, we pro- pose a viable alternative to existing learning methods by pre- senting MoleHD, a method based on brain-inspired hyper- dimensional computing (HDC) for molecular property pre- diction. We develop HDC encoders to project SMILES rep- resentation of a molecule into high-dimensional vectors that are used for HDC training and inference. We perform an ex- tensive evaluation using 29 classification tasks from 3 widely- used molecule datasets (Clintox, BBBP, SIDER) under three splits methods (random, scaffold, and stratified). By an com- prehensive comparison with 8 existing learning models in- cluding SOTA graph/recurrent neural networks, we show that MoleHD is able to achieve highest ROC-AUC score on random and scaffold splits on average across 3 datasets and achieve second-highest on stratified split. Importantly, MoleHD achieves such performance with significantly re- duced computing cost and training efforts. To the best of our knowledge, this is the first HDC-based method for drug dis- covery. The promising results presented in this paper can po- tentially lead to a novel path in drug discovery research. Introduction Drug discovery is the process of using multi-disciplinary knowledge such as biology, chemistry and pharmacology to discover proficient medications amongst candidates ac- cording to safety and efficacy requirements. Modern drug discovery often features a virtual screening process to se- lect candidates from general chemical databases such as ChEMBL (Gaulton et al. 2012) and OpenChem (Kim et al. 2016) to build a significant smaller in-house database for further synthesis. Conventional virtual screening based on computational methods such as similarity searching (Cereto-Massagu´ e Copyright © 2022, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. et al. 2015) and pharmacophere mapping (Sun 2008) are fac- ing stronger obstacles due to the growing magnitude of data. Therefore, data-driven machine learning techniques are in- creasingly applied into drug discovery, particularly in pre- dicting molecular properties with drug discovery objectives from very large volume of data. Traditional machine learning algorithms such as random forest (Jayaraj et al. 2016), support vector machine (Liew et al. 2009), k nearest neighbors (Arian et al. 2020), and gra- dient boosting (Wu et al. 2018) have been investigated in drug discovery applications. Such algorithms use molecu- lar representations as input to predict molecular properties. However, because of limited sophistication, deep and com- plex structural information within a molecule is generally overlooked by those models. Thus, they typically do not exhibit strong capability in learning the features and only achieve sub-par performance. On the other hand, inspired by the recent success from other applications such as computer vision, neural network models have been increasingly applied in drug discovery. As the 2-D structures of molecules are essentially graph-like patterns with atoms (nodes) and bonds (edges), graph neu- ral networks (GNNs) can be naturally applied. GNN learns representations by aggregating nodes and neighbouring in- formation for molecular property predictions under differ- ent drug discovery objectives. However, molecular graphs often requires pre-processing or featurization. Extended- connectivity fingerprints (ECFP) is one of the most com- mon featurization method that converts molecular graphs into fixed length representations, or fingerprints (Rogers and Hahn 2010). Such featurization algorithms usually requires comprehensive efforts using chemical tool-chains such as RDKit (Landrum 2013). This paper takes a radical departure from common ma- chine learning methods including neural networks by devel- oping a brain-inspired hyperdimensional computing (HDC) model that requires less pre-processing efforts and is eas- ier to implement. Inspired by the attributes of brain circuits including high-dimensionality and fully distributed holo- graphic representation, this emerging computing paradigm postulates the generation, manipulation, and comparison of symbols represented by high-dimensional vectors, e.g., 10,000 dimensions. Compared with DNNs, the advantages of HDC include smaller model size, less computation cost, arXiv:2106.02894v2 [cs.NE] 20 Sep 2021