biology
Article
An Alignment-Independent Approach for the Study of Viral
Sequence Diversity at Any Given Rank of Taxonomy Lineage
Li Chuin Chong
1
, Wei Lun Lim
2
, Kenneth Hon Kim Ban
3
and Asif M. Khan
1,4,
*
Citation: Chong, L.C.; Lim, W.L.;
Ban, K.H.K.; Khan, A.M. An
Alignment-Independent Approach
for the Study of Viral Sequence
Diversity at Any Given Rank of
Taxonomy Lineage. Biology 2021, 10,
853. https://doi.org/10.3390/
biology10090853
Academic Editors: Ville Pimenoff,
Riaan F. Rifkin and
Simon Underdown
Received: 11 June 2021
Accepted: 19 August 2021
Published: 31 August 2021
Publisher’s Note: MDPI stays neutral
with regard to jurisdictional claims in
published maps and institutional affil-
iations.
Copyright: © 2021 by the authors.
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
4.0/).
1
Centre for Bioinformatics, School of Data Sciences, Perdana University, Kuala Lumpur 50490, Malaysia;
lichuinchong@gmail.com
2
Faculty of Computing and Informatics, Multimedia University, Cyberjaya 63100, Malaysia;
lynerlwl@gmail.com
3
Department of Biochemistry, Yong Loo Lin School of Medicine, National University of Singapore,
Singapore 117596, Singapore; bchbhkk@nus.edu.sg
4
Beykoz Institute of Life Sciences and Biotechnology, Bezmialem Vakif University, Beykoz,
34820 Istanbul, Turkey
* Correspondence: asif@perdanauniversity.edu.my or makhan@bezmialem.edu.tr
Simple Summary: Viral sequence variation can expand the host repertoire, enhance the infection
ability, and/or prevent the build-up of a long-term specific immunity by the host. The study of
viral diversity is, thus, critical to understand sequence change and its implications for intervention
strategies. Typically, these studies are performed using alignment-dependent approaches. However,
such an approach becomes limited with increase in sequence diversity. Herein, we present an
alignment-free algorithm, implemented as a publicly available tool, UNIQmin, to determine the
effective viral sequence diversity at any rank of the viral taxonomy lineage. UNIQmin enables the
generation of a minimal set for a given sequence dataset of interest and is applicable to big data,
with a reasonable time performance. The minimal set is the smallest possible number of unique
sequences required to represent a given peptidome diversity (pool of distinct peptides of a specific
length) exhibited by a non-redundant dataset. This compression is possible through the removal of
unique sequences that do not contribute effectively to the peptidome diversity pool. The utility of
UNIQmin was demonstrated for the species Dengue virus, genus Flavivirus, family Flaviviridae, and
superkingdom Viruses. The concept of a minimal set is generic and thus possibly applicable to both
genomic and proteomic data of non-viral, pathogenic microorganisms.
Abstract: The study of viral diversity is imperative in understanding sequence change and its
implications for intervention strategies. The widely used alignment-dependent approaches to study
viral diversity are limited in their utility as sequence dissimilarity increases, particularly when
expanded to the genus or higher ranks of viral species lineage. Herein, we present an alignment-
independent algorithm, implemented as a tool, UNIQmin, to determine the effective viral sequence
diversity at any rank of the viral taxonomy lineage. This is done by performing an exhaustive search
to generate the minimal set of sequences for a given viral non-redundant sequence dataset. The
minimal set is comprised of the smallest possible number of unique sequences required to capture
the diversity inherent in the complete set of overlapping k-mers encoded by all the unique sequences
in the given dataset. Such dataset compression is possible through the removal of unique sequences,
whose entire repertoire of overlapping k-mers can be represented by other sequences, thus rendering
them redundant to the collective pool of sequence diversity. A significant reduction, namely ~44%,
~45%, and ~53%, was observed for all reported unique sequences of species Dengue virus, genus
Flavivirus, and family Flaviviridae, respectively, while still capturing the entire repertoire of nonamer
(9-mer) viral peptidome diversity present in the initial input dataset. The algorithm is scalable for big
data as it was applied to ~2.2 million non-redundant sequences of all reported viruses. UNIQmin is
open source and publicly available on GitHub. The concept of a minimal set is generic and, thus,
potentially applicable to other pathogenic microorganisms of non-viral origin, such as bacteria.
Biology 2021, 10, 853. https://doi.org/10.3390/biology10090853 https://www.mdpi.com/journal/biology