1545-5963 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2019.2944610, IEEE/ACM Transactions on Computational Biology and Bioinformatics IEEE TRANSACTIONSON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS JULY 2019 1 Hybrid machine learning models for predicting types of Human T-cell Lymphotropic Virus Gaurav Sharma, Prashant Singh Rana and Seema Bawa Abstract—Life threatening diseases like adult T-cell leukemia, neurodegenerative diseases, demyelinating diseases such as HTLV-1 based myelopathy/tropical spastic paraparesis (HAM/TSP), hypocalcaemia, and bone lesions are caused by group of human retrovirus known as Human T-cell Lymphotropic virus (HTLV). Out of the four different types of HTLVs, HTLV-1 is most prominent in scourging over 20 million people around the world and still not much effort has been made in understanding the epidemiology and controlling the prevalence of this virus. This condition further worsens when most of the infected cases remain asymptomatic throughout their lifetime due to the limited diagnostic methods; that are most of the times unavailable for timely detection of infected individuals. Moreover at present there is no licensed vaccination for HTLV-1 infection. Therefore there is a need to develop the faster and efficient diagnostic method for the detection of HTLV-1. Influenced from the outcomes of the machine learning techniques in the field of bio-informatics, this is the first study in which 64 hybrid machine learning techniques have been proposed for the prediction of different type of HTLVs (HTLV-1,HTLV-2 and HTLV-3). The hybrid techniques are build by permutation and combination of four classification methods, four feature weighting and four feature selection techniques. The proposed hybrid models when evaluated on the basis of various model evaluation parameters are found to be capable of efficiently predicting the type of HTLVs. The best hybrid model has been identified of having accuracy, AUROC value and F1 score of 99.85%, 0.99 and 0.99 respectively. This kind of the system can assist the current diagnostic system for the detection of HTLV-1 as after the molecular diagnostics of HTLV by various screening tests like enzyme-linked immunoassay or particle agglutination assays there is always a need of confirmatory tests like western blotting, immuno-fluorescence assay or radio- immuno-precipitation assay for distinguishing HTLV-1 from HTLV-2. These confirmatory tests are indeed very complex analytical techniques involving various steps. The proposed hybrid techniques can be used to support and verify the results of confirmatory test from the protein mixture. Furthermore, better insights about the virus can be obtained by exploring the physicochemical properties of the protein sequences of HTLVs. Index Terms—HTLV Prediction, Machine Learning, Random Forest, K-means , Feature Selection, Heuristic Technique. ✦ 1 I NTRODUCTION Human T-cell lymphotropic virus (HTLV-1) falls in the category of retroviruses and was first identified in 1980 in the T-cell line derived from a patient diagnosed with cutaneous T-cell lymphoma [1]. Soon after the discovery, HTLV-1 was found to be identical with adult T-cell leukemia virus (ATLV) at sequence level [2]. A second category of the human retrovirus identified after the discovery of HTLV-1 was desig- nated as HTLV-2. It was found to have resemblance in genome structure and about 70% similar nucleotide sequence homology to HTLV-1 [3]. Later in the year 2005 the other related viruses with HTLV-1, were reported in central Africa as HTLV-3 and HTLV-4 [4]. However, currently only HTLV-1 is found to be linked with human diseases. Globally approximately 20 million people are estimated to be infected with this oncogenic retrovirus HTLV-1, but still its epidemi- ology is not properly understood [5]. The three main routes for the transmission of HTLV-1 are mother • Gaurav Sharma, Prashant Singh Rana and Seema Bawa are with Computer Science and Engineering Department, Thapar Institute of Engineering and Technology, Patiala, Punjab-147004, India. E-mail: ergauravrandev@gmail.com to child transmission(primarily via breast feeding), parenteral transmission (through contaminated blood transfusion or by sharing infected needles) and sexual transmission (male to female)[6]. However, all the transmission modes of HTLV-1 carries a vast similar- ity with transmission modes of human immunodefi- ciency virus (HIV-1), yet a very little effort has been made to prevent or reduce its transmission. Further- more, the infected person has a high risk for adult T- cell leukemia (ATL), development of rapidly progres- sive malignancy, myelopathy/tropical spastic para- paresis (HAM/TSP), debilitating and sometimes fatal neurologic condition [7] [8]. In the family of retro- virus HTLV-1 falls in the sub-category of deltaretro- viruses, which also includes other viruses like bovine leukemia virus, simian T-cell leukemia virus (STLV) and HTLV-2. Like HTLV-1,bovine leukemia virus and STLV also causes lymphoid malignancies in the host. The studies suggest that HTLV and STLV might have originated from the common ancestors as they carries a similar molecular, virological and epidemiological features and for this reason they are designated as primate T-cell leukemia viruses (PTLVs) [9][10]. As per the geographic distribution; southwestern Japan, sub-Saharan Africa, few regions of Central and South America and Caribbean islands are the