Int. J. Data Science, Vol. 3, No. 1, 2018 19 Copyright © 2018 Inderscience Enterprises Ltd. Sequence similarity using composition method Geetika Munjal*, Pooja Sharma and Deepti Gaur Department of Computer Science and Engineering, The Northcap University, Gurgaon 122017, India Email: munjal.geetika@gmail.com Email: poojasharma0463@gmail.com Email: deeptigaur@ncuindia.edu *Corresponding author Abstract: Deoxyribo nucleic acid (DNA) has enormous capacity to carry very important information in the form of character strings. Sequence analysis is the process of applying a wide range of methods to DNA sequences for understanding the structure, feature or evolution of these nucleotides strings. The analysis uses mathematical methods to convert these character strings to numerical values, and these numerical values are used to find similarity between the sequences. DNA sequences only contain four nucleotides: A, C, G and T, but in order to find information from these sequences, sequence comparison becomes essential. In this paper, various methods to analyse DNA sequences including usage of entropy, divergence, LZ complexity and the role of hybridisation are explored. A hybrid model based on the composition vector and distance methods is proposed to find dissimilarity between sequences and this hybrid model is tested on sequences of species downloaded from National Center for Biotechnology Information (NCBI). Keywords: nucleotides; entropy; frequency vector. Reference to this paper should be made as follows: Munjal, G., Sharma, P. and Gaur, D. (2018) ‘Sequence similarity using composition method’, Int. J. Data Science, Vol. 3, No. 1, pp.19–28. Biographical notes: Geetika Munjal is pursuing her PhD in Datamining from Institute of Technology and Management, The Northcap University (formerly ITM University), Gurgaon, Haryana, India. Pooja Sharma is MTech Student of The Northcap University (formerly ITM University), Gurgaon, Haryana, India. Deepti Gaur is an Associate Professor at The Northcap University (formerly ITM University), Gurgaon, Haryana, India. 1 Introduction Sequence analysis is the process of applying a wide range of methods to DNA sequences for understanding structure, features or evolution of sequences. Sequence analysis is used