Vol.:(0123456789)
The Journal of Supercomputing
https://doi.org/10.1007/s11227-023-05602-8
1 3
A novel apache spark‑based 14‑dimensional scalable
feature extraction approach for the clustering of genomics
data
Rajesh Dwivedi
1
· Aruna Tiwari
1
· Neha Bharill
2
· Milind Ratnaparkhe
3
·
Parul Mogre
1
· Pranjal Gadge
4
· Kethavath Jagadeesh
1
Accepted: 17 August 2023
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature
2023
Abstract
Feature extraction is essential in bioinformatics because it transforms genomics
sequences into feature vectors, which are needed for clustering to discover the family
of newly sequenced genome. Most of the existing feature extraction methods extract
similar features for dissimilar sequences, do not extract context-based features and
unable to handle millions of genome sequences because they are not scalable. So,
to tackle these challenges, we proposed an efcient apache spark-based scalable fea-
ture extraction approach that extracts signifcantly important features from millions
of genome sequences in less computational time. The proposed approach extracts
features in fve stages, i.e., based on the length of the sequence, the frequency of
nucleotide bases, the pattern organization of nucleotide bases, distribution of nucle-
otide bases, and the entropy of the sequence to generate a fxed-length numeric vec-
tor consist of only 14 dimensions to describe each genome sequence uniquely. The
proposed approach efciently extracts the context-based features in terms of pattern
organization and distribution, also removes the drawback of extracting same features
for the dissimilar sequences using a novel power method. The feature extracted with
the proposed scalable feature extraction approach is applied on k-means and fuzzy
c-means clustering techniques. The experimental results show that the proposed
method is highly successful and efcient in terms of computing time in comparison
to other state-of-the-art approaches.
Keywords Feature extraction · Apache spark · Genome sequences · k-means · Fuzzy
c-means
Extended author information available on the last page of the article
Content courtesy of Springer Nature, terms of use apply. Rights reserved.