Int. J. Data Science, Vol. 3, No. 1, 2018 19
Copyright © 2018 Inderscience Enterprises Ltd.
Sequence similarity using composition method
Geetika Munjal*, Pooja Sharma
and Deepti Gaur
Department of Computer Science and Engineering,
The Northcap University,
Gurgaon 122017, India
Email: munjal.geetika@gmail.com
Email: poojasharma0463@gmail.com
Email: deeptigaur@ncuindia.edu
*Corresponding author
Abstract: Deoxyribo nucleic acid (DNA) has enormous capacity to carry very
important information in the form of character strings. Sequence analysis is the
process of applying a wide range of methods to DNA sequences for
understanding the structure, feature or evolution of these nucleotides strings.
The analysis uses mathematical methods to convert these character strings to
numerical values, and these numerical values are used to find similarity
between the sequences. DNA sequences only contain four nucleotides: A, C, G
and T, but in order to find information from these sequences, sequence
comparison becomes essential. In this paper, various methods to analyse DNA
sequences including usage of entropy, divergence, LZ complexity and the role
of hybridisation are explored. A hybrid model based on the composition vector
and distance methods is proposed to find dissimilarity between sequences and
this hybrid model is tested on sequences of species downloaded from National
Center for Biotechnology Information (NCBI).
Keywords: nucleotides; entropy; frequency vector.
Reference to this paper should be made as follows: Munjal, G., Sharma, P. and
Gaur, D. (2018) ‘Sequence similarity using composition method’, Int. J. Data
Science, Vol. 3, No. 1, pp.19–28.
Biographical notes: Geetika Munjal is pursuing her PhD in Datamining from
Institute of Technology and Management, The Northcap University (formerly
ITM University), Gurgaon, Haryana, India.
Pooja Sharma is MTech Student of The Northcap University (formerly ITM
University), Gurgaon, Haryana, India.
Deepti Gaur is an Associate Professor at The Northcap University (formerly
ITM University), Gurgaon, Haryana, India.
1 Introduction
Sequence analysis is the process of applying a wide range of methods to DNA sequences
for understanding structure, features or evolution of sequences. Sequence analysis is used