TransDTI: Transformer-Based Language Models for Estimating DTIs and Building a Drug Recommendation Workﬂow Yogesh Kalakoti, Shashank Yadav, and Durai Sundar* Cite This: ACS Omega 2022, 7, 2706-2717 Read Online ACCESS Metrics & More Article Recommendations * sı Supporting Information ABSTRACT: The identiﬁcation of novel drug-target interactions is a labor-intensive and low-throughput process. In silico alternatives have proved to be of immense importance in assisting the drug discovery process. Here, we present TransDTI, a multiclass classiﬁcation and regression workﬂow employing transformer-based language models to segregate interactions between drug-target pairs as active, inactive, and intermediate. The models were trained with large-scale drug-target interaction (DTI) data sets, which reported an improvement in performance in terms of the area under receiver operating characteristic (auROC), the area under precision recall (auPR), Matthew’s correlation coeﬃcient (MCC), and R2 over baseline methods. The results showed that models based on transformer-based language models eﬀectively predict novel drug-target interactions from sequence data. The proposed models signiﬁcantly outperformed existing methods like DeepConvDTI, DeepDTA, and DeepDTI on a test data set. Further, the validity of novel interactions predicted by TransDTI was found to be backed by molecular docking and simulation analysis, where the model prediction had similar or better interaction potential for MAP2k and transforming growth factor-β (TGFβ) and their known inhibitors. Proposed approaches can have a signiﬁcant impact on the development of personalized therapy and clinical decision making. 1. INTRODUCTION Identiﬁcation of novel drug-target interactions (DTIs) is generally a stagnant, labor-intensive, and precarious process. A conventional drug discovery and development pipeline can burn through a billion USD, and more importantly, around 14 years. 1,2 Assay-based protocols in a drug discovery workﬂow follow several steps, including lead identiﬁcation, optimization, screening, and characterization, eventually escalating the ﬁnancial and temporal burden. Alternatively, computational methods have gathered pace for their utility in predicting novel drug-target interactions and aiding the process of drug discovery. 3,4 Although traditional methods beat in silico alternatives in terms of reliability and robustness, experimental characterization of every possible drug-target is not practical owing to its low-throughput nature. Traditional DTI prediction workﬂows can be categorized into three classes: (i) ligand-based approaches, (ii) docking- based approaches, and (iii) chemogenomic approaches. 5-7 In DTI prediction, computational approaches are divided into three major groups. Molecular similarity serves as a deciding criterion for ligand-based approaches. 8 However, due to insuﬃcient data regarding various targets, this approach can be erroneous. Similarly, docking-based approaches rely on molecular structures and sophisticated algorithms/software to simulate interactions between the drug-target pair under consideration. The biggest bottleneck of such an approach is the nonavailability of quality three-dimensional (3D) protein structures. 9 Experimental techniques for solving a protein’s crystal structure are time-taking and labor-intensive processes. For instance, solving the 3D structure for targets like G protein-coupled receptors (GPCRs) is still challenging. 10 Therefore, docking-based approaches can only cover a fraction of the entire DTI spectrum. Alternatively, chemogenomic approaches try to evade the drawbacks of the aforementioned methods by concurrently employing the information of drug and target to establish their association. Advances in sequencing technologies have enabled the collection of vast amounts of biological data. 11 Data at such a scale have presented a golden opportunity for developing powerful sequence-based approaches for modeling the protein structure and functions, eventually aiding DTI prediction. Similar to grammatical rules responsible for the working of natural languages, biological sequences hold semantic and syntactical information that govern their functioning, mecha- Received: September 19, 2021 Accepted: December 28, 2021 Published: January 12, 2022 Article http://pubs.acs.org/journal/acsodf © 2022 The Authors. Published by American Chemical Society 2706 https://doi.org/10.1021/acsomega.1c05203 ACS Omega 2022, 7, 2706-2717 Downloaded via 3.237.31.31 on January 26, 2022 at 13:12:24 (UTC). See https://pubs.acs.org/sharingguidelines for options on how to legitimately share published articles.