MoleHD: Drug Discovery using Brain-Inspired Hyperdimensional Computing
Dongning Ma, Rahul Thapa, Xun Jiao
Villanova Univeristy
{dma2, rthapa, xun.jiao}@villanova.edu
Abstract
Modern drug discovery is often time-consuming, complex
and cost-ineffective due to the large volume of molecular
data and complicated molecular properties. Recently, ma-
chine learning algorithms have shown promising results in
virtual screening of automated drug discovery by predicting
molecular properties. While emerging learning methods such
as graph neural networks and recurrent neural networks ex-
hibit high accuracy, they are also notoriously computation-
intensive and memory-intensive with operations such as fea-
ture embeddings or deep convolutions. In this paper, we pro-
pose a viable alternative to existing learning methods by pre-
senting MoleHD, a method based on brain-inspired hyper-
dimensional computing (HDC) for molecular property pre-
diction. We develop HDC encoders to project SMILES rep-
resentation of a molecule into high-dimensional vectors that
are used for HDC training and inference. We perform an ex-
tensive evaluation using 29 classification tasks from 3 widely-
used molecule datasets (Clintox, BBBP, SIDER) under three
splits methods (random, scaffold, and stratified). By an com-
prehensive comparison with 8 existing learning models in-
cluding SOTA graph/recurrent neural networks, we show
that MoleHD is able to achieve highest ROC-AUC score
on random and scaffold splits on average across 3 datasets
and achieve second-highest on stratified split. Importantly,
MoleHD achieves such performance with significantly re-
duced computing cost and training efforts. To the best of our
knowledge, this is the first HDC-based method for drug dis-
covery. The promising results presented in this paper can po-
tentially lead to a novel path in drug discovery research.
Introduction
Drug discovery is the process of using multi-disciplinary
knowledge such as biology, chemistry and pharmacology
to discover proficient medications amongst candidates ac-
cording to safety and efficacy requirements. Modern drug
discovery often features a virtual screening process to se-
lect candidates from general chemical databases such as
ChEMBL (Gaulton et al. 2012) and OpenChem (Kim et al.
2016) to build a significant smaller in-house database for
further synthesis.
Conventional virtual screening based on computational
methods such as similarity searching (Cereto-Massagu´ e
Copyright © 2022, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
et al. 2015) and pharmacophere mapping (Sun 2008) are fac-
ing stronger obstacles due to the growing magnitude of data.
Therefore, data-driven machine learning techniques are in-
creasingly applied into drug discovery, particularly in pre-
dicting molecular properties with drug discovery objectives
from very large volume of data.
Traditional machine learning algorithms such as random
forest (Jayaraj et al. 2016), support vector machine (Liew
et al. 2009), k nearest neighbors (Arian et al. 2020), and gra-
dient boosting (Wu et al. 2018) have been investigated in
drug discovery applications. Such algorithms use molecu-
lar representations as input to predict molecular properties.
However, because of limited sophistication, deep and com-
plex structural information within a molecule is generally
overlooked by those models. Thus, they typically do not
exhibit strong capability in learning the features and only
achieve sub-par performance.
On the other hand, inspired by the recent success from
other applications such as computer vision, neural network
models have been increasingly applied in drug discovery.
As the 2-D structures of molecules are essentially graph-like
patterns with atoms (nodes) and bonds (edges), graph neu-
ral networks (GNNs) can be naturally applied. GNN learns
representations by aggregating nodes and neighbouring in-
formation for molecular property predictions under differ-
ent drug discovery objectives. However, molecular graphs
often requires pre-processing or featurization. Extended-
connectivity fingerprints (ECFP) is one of the most com-
mon featurization method that converts molecular graphs
into fixed length representations, or fingerprints (Rogers and
Hahn 2010). Such featurization algorithms usually requires
comprehensive efforts using chemical tool-chains such as
RDKit (Landrum 2013).
This paper takes a radical departure from common ma-
chine learning methods including neural networks by devel-
oping a brain-inspired hyperdimensional computing (HDC)
model that requires less pre-processing efforts and is eas-
ier to implement. Inspired by the attributes of brain circuits
including high-dimensionality and fully distributed holo-
graphic representation, this emerging computing paradigm
postulates the generation, manipulation, and comparison
of symbols represented by high-dimensional vectors, e.g.,
10,000 dimensions. Compared with DNNs, the advantages
of HDC include smaller model size, less computation cost,
arXiv:2106.02894v2 [cs.NE] 20 Sep 2021