Vol.:(0123456789) SN Computer Science (2022) 3:61 https://doi.org/10.1007/s42979-021-00948-3 SN Computer Science ORIGINAL RESEARCH Hydropathy and Conformational Similarity‑Based Distributed Representation of Protein Sequences for Properties Prediction Hrushikesh Bhosale 1  · Ashwin Lahorkar 2  · Divye Singh 3  · Aamod Sane 1  · Jayaraman Valadi 1 Received: 3 September 2021 / Accepted: 18 October 2021 © The Author(s), under exclusive licence to Springer Nature Singapore Pte Ltd 2021 Abstract In the natural language processing community conventional features like TF-IDF are commonly employed for text mining and other applications. These conventional features lack semantic/syntactic information. Researchers in the text mining feld discovered that distributed representation of words can indeed contain this information and increase the predictive power of algorithms. Word2Vec to learn word embeddings from texts is a very popular distributed representation in NLP tasks. Recently researchers introduced these distributed representations, viz., ProtVec, for various protein function annota- tion tasks with considerable success. We, in this work, have developed reduced protein alphabet representations employing two diferent reduction schemes for four diferent regression tasks. Employing the entire Swiss-Prot annotated sequences we have extracted the embedding vectors using skip-gram models with diferent embedding vector sizes, k-mer sizes and context window sizes. We then used these vectors as input to the Support Vector Machines regression algorithm to build regression models. In this way we built seven diferent models which include the original ProtVec model, hydropathy-based reduced alphabet model, conformational similarity-based reduced alphabet model and all possible combinations of these three aforementioned models. The performance improvement in absorption and enantioselectivity tasks indicate that grouping the alphabets on an appropriate basis can indeed play a major role in enhancing algorithm capabilities. Our results on all the four tasks indicate individual-reduced alphabet representations and certain synergistic combinations can considerably increase prediction performance. This new method exhibits multiple advantages including improved semantic/syntactic information and more compact reduced representations. This method can also provide important domain information which can be used in further experimentations to develop sequences with desired properties. Keywords ProtVec · RA2Vec · SVM · Protein property predictions Introduction In the natural language processing community conventional features like TF-IDF are commonly employed for text min- ing and other applications. These conventional features lack semantic/syntactic information. Researchers in text mining discovered that distributed representation of words can indeed contain this information and increase the pre- dictive power of algorithms. With this aim, Mikolov et al. [1] presented a new paradigm known as Word2Vec to learn word embeddings from texts. These embeddings are numeri- cal representations of any given word in a text. Word2Vec successfully employs this distributed representation and embeds every word in a document in an n-dimensional vec- tor space. With the new model, there were very good per- formance improvements in several text mining applications like sentiment analysis. Word2Vec consists of two diferent This article is part of the topical collection “Enabling Innovative Computational Intelligence Technologies for IOT” guest edited by Omer Rana, Rajiv Misra, Alexander Pfeifer, Luigi Troiano and Nishtha Kesswani. * Jayaraman Valadi valadi@gmail.com 1 Department of Computer Science, FLAME University, Pune, Maharashtra, India 2 CMS SPPU, Pune, Maharashtra, India 3 Engineering for Research, Thoughtworks Technologies, Pune, Maharashtra, India