TransDTI: Transformer-Based Language Models for Estimating DTIs
and Building a Drug Recommendation Workflow
Yogesh Kalakoti, Shashank Yadav, and Durai Sundar*
Cite This: ACS Omega 2022, 7, 2706-2717 Read Online
ACCESS
Metrics & More Article Recommendations * sı Supporting Information
ABSTRACT: The identification of novel drug-target interactions
is a labor-intensive and low-throughput process. In silico
alternatives have proved to be of immense importance in assisting
the drug discovery process. Here, we present TransDTI, a
multiclass classification and regression workflow employing
transformer-based language models to segregate interactions
between drug-target pairs as active, inactive, and intermediate.
The models were trained with large-scale drug-target interaction
(DTI) data sets, which reported an improvement in performance
in terms of the area under receiver operating characteristic
(auROC), the area under precision recall (auPR), Matthew’s
correlation coefficient (MCC), and R2 over baseline methods. The
results showed that models based on transformer-based language
models effectively predict novel drug-target interactions from sequence data. The proposed models significantly outperformed
existing methods like DeepConvDTI, DeepDTA, and DeepDTI on a test data set. Further, the validity of novel interactions
predicted by TransDTI was found to be backed by molecular docking and simulation analysis, where the model prediction had
similar or better interaction potential for MAP2k and transforming growth factor-β (TGFβ) and their known inhibitors. Proposed
approaches can have a significant impact on the development of personalized therapy and clinical decision making.
1. INTRODUCTION
Identification of novel drug-target interactions (DTIs) is
generally a stagnant, labor-intensive, and precarious process. A
conventional drug discovery and development pipeline can
burn through a billion USD, and more importantly, around 14
years.
1,2
Assay-based protocols in a drug discovery workflow
follow several steps, including lead identification, optimization,
screening, and characterization, eventually escalating the
financial and temporal burden. Alternatively, computational
methods have gathered pace for their utility in predicting novel
drug-target interactions and aiding the process of drug
discovery.
3,4
Although traditional methods beat in silico
alternatives in terms of reliability and robustness, experimental
characterization of every possible drug-target is not practical
owing to its low-throughput nature.
Traditional DTI prediction workflows can be categorized
into three classes: (i) ligand-based approaches, (ii) docking-
based approaches, and (iii) chemogenomic approaches.
5-7
In
DTI prediction, computational approaches are divided into
three major groups. Molecular similarity serves as a deciding
criterion for ligand-based approaches.
8
However, due to
insufficient data regarding various targets, this approach can
be erroneous. Similarly, docking-based approaches rely on
molecular structures and sophisticated algorithms/software to
simulate interactions between the drug-target pair under
consideration. The biggest bottleneck of such an approach is
the nonavailability of quality three-dimensional (3D) protein
structures.
9
Experimental techniques for solving a protein’s
crystal structure are time-taking and labor-intensive processes.
For instance, solving the 3D structure for targets like G
protein-coupled receptors (GPCRs) is still challenging.
10
Therefore, docking-based approaches can only cover a fraction
of the entire DTI spectrum. Alternatively, chemogenomic
approaches try to evade the drawbacks of the aforementioned
methods by concurrently employing the information of drug
and target to establish their association.
Advances in sequencing technologies have enabled the
collection of vast amounts of biological data.
11
Data at such a
scale have presented a golden opportunity for developing
powerful sequence-based approaches for modeling the protein
structure and functions, eventually aiding DTI prediction.
Similar to grammatical rules responsible for the working of
natural languages, biological sequences hold semantic and
syntactical information that govern their functioning, mecha-
Received: September 19, 2021
Accepted: December 28, 2021
Published: January 12, 2022
Article http://pubs.acs.org/journal/acsodf
© 2022 The Authors. Published by
American Chemical Society
2706
https://doi.org/10.1021/acsomega.1c05203
ACS Omega 2022, 7, 2706-2717
Downloaded via 3.237.31.31 on January 26, 2022 at 13:12:24 (UTC).
See https://pubs.acs.org/sharingguidelines for options on how to legitimately share published articles.