Vol.:(0123456789)
SN Computer Science (2022) 3:61
https://doi.org/10.1007/s42979-021-00948-3
SN Computer Science
ORIGINAL RESEARCH
Hydropathy and Conformational Similarity‑Based Distributed
Representation of Protein Sequences for Properties Prediction
Hrushikesh Bhosale
1
· Ashwin Lahorkar
2
· Divye Singh
3
· Aamod Sane
1
· Jayaraman Valadi
1
Received: 3 September 2021 / Accepted: 18 October 2021
© The Author(s), under exclusive licence to Springer Nature Singapore Pte Ltd 2021
Abstract
In the natural language processing community conventional features like TF-IDF are commonly employed for text mining
and other applications. These conventional features lack semantic/syntactic information. Researchers in the text mining
feld discovered that distributed representation of words can indeed contain this information and increase the predictive
power of algorithms. Word2Vec to learn word embeddings from texts is a very popular distributed representation in NLP
tasks. Recently researchers introduced these distributed representations, viz., ProtVec, for various protein function annota-
tion tasks with considerable success. We, in this work, have developed reduced protein alphabet representations employing
two diferent reduction schemes for four diferent regression tasks. Employing the entire Swiss-Prot annotated sequences
we have extracted the embedding vectors using skip-gram models with diferent embedding vector sizes, k-mer sizes and
context window sizes. We then used these vectors as input to the Support Vector Machines regression algorithm to build
regression models. In this way we built seven diferent models which include the original ProtVec model, hydropathy-based
reduced alphabet model, conformational similarity-based reduced alphabet model and all possible combinations of these three
aforementioned models. The performance improvement in absorption and enantioselectivity tasks indicate that grouping the
alphabets on an appropriate basis can indeed play a major role in enhancing algorithm capabilities. Our results on all the four
tasks indicate individual-reduced alphabet representations and certain synergistic combinations can considerably increase
prediction performance. This new method exhibits multiple advantages including improved semantic/syntactic information
and more compact reduced representations. This method can also provide important domain information which can be used
in further experimentations to develop sequences with desired properties.
Keywords ProtVec · RA2Vec · SVM · Protein property predictions
Introduction
In the natural language processing community conventional
features like TF-IDF are commonly employed for text min-
ing and other applications. These conventional features
lack semantic/syntactic information. Researchers in text
mining discovered that distributed representation of words
can indeed contain this information and increase the pre-
dictive power of algorithms. With this aim, Mikolov et al.
[1] presented a new paradigm known as Word2Vec to learn
word embeddings from texts. These embeddings are numeri-
cal representations of any given word in a text. Word2Vec
successfully employs this distributed representation and
embeds every word in a document in an n-dimensional vec-
tor space. With the new model, there were very good per-
formance improvements in several text mining applications
like sentiment analysis. Word2Vec consists of two diferent
This article is part of the topical collection “Enabling Innovative
Computational Intelligence Technologies for IOT” guest edited by
Omer Rana, Rajiv Misra, Alexander Pfeifer, Luigi Troiano and
Nishtha Kesswani.
* Jayaraman Valadi
valadi@gmail.com
1
Department of Computer Science, FLAME University, Pune,
Maharashtra, India
2
CMS SPPU, Pune, Maharashtra, India
3
Engineering for Research, Thoughtworks Technologies,
Pune, Maharashtra, India