Kala Karun. Int. Journal of Engineering Research and Application www.ijera.com ISSN : 2248-9622, Vol. 7, Issue 12, ( Part -1) December 2017, pp.18-23 www.ijera.com DOI: 10.9790/9622-0712011823 18 | Page Biological Sequence Alignment - A Review Kala Karun*, Dancy Kurian**, Sheeja Y.S.*** *(Department of Computer Science & Engineering, College of Engineering Kottarakara, Kerala, India Email: kalavipin@gmail.com) ** (Department of Computer Science & Engineering, College of Engineering Attingal, Kerala, India Email: dancyk@gmail.com) *** (Department of Computer Science & Engineering, College of Engineering Attingal, Kerala, India Email: yssheeja@gmail.com) ABSTRACT Bioinformatics is an emerging interdisciplinary research area that deals with the computational management and analysis of biological information. Genomics is the most important domain in bioinformatics which compares genomic features like DNA sequences, genes, regulatory sequences, or other genomic structural components etc. of different organisms. Computers are used to gather, store, analyze and integrate biological and genetic information which can then be applied to gene-based drug discovery and development. Scientists may require weeks or months if they use their own workstations since biological big data is generated by several different bioinformatics/biological/biomedical experiments and it can be presented as structured or unstructured data. Each cell in the body contains a whole genome, yet the data packed into a few DNA molecules could fill a hard drive. Biological big data is now reaching the size of Terabytes, Petabytes and exa bytes and the different modes of representation adds complexity. It introduces many challenges such as handling of complex information; integration of heterogeneous resources; analysis on big data. Advanced methods to handle the volume of data and speed of analysis scientists may require weeks or months if they use their own workstations. Sequence alignment is a standard technique in bioinformatics for visualizing the relationships between residues in a collection of evolutionarily or structurally related proteins. In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity which may be a consequence of functional, structural, or evolutionary relationships between the sequences and are used to infer biological information. Since the sequence data bases are big databases, the existing techniques and algorithms have several computational challenges. A review of the major sequence alignment algorithms are discussed in this paper. Keywords – Bioinformatics, Genomics, DNA molecules, Hadoop, RNA molecules --------------------------------------------------------------------------------------------------------------------------------------- Date of Submission: 24-11-2017 Date of acceptance: 05-12-2017 --------------------------------------------------------------------------------------------------------------------------------------- I. INTRODUCTION Sequence alignment is a technique Bioinformatics which is used for visualizing the relationship among the residues with a collection of structural or evolutionary proteins. In the amino acid sequence, there are set of proteins that need to be compared with the alignment that displays the residues of the protein in a single line with the gaps (―-―) which means it is ―equivalent‖ residues that appears in the same column. The precise meaning of equivalence is generally contested dependent: for the phylogeneticist, equivalent residues have common evolutionary ancestry; for the structural biologist, equivalent residues correspond to analogous positions belonging to homologous folds in a set of proteins; for the molecular biologist, equivalent residues play similar functional roles in their corresponding proteins. In each case, an alignment provides a bird’s eye view of the underlying evolutionary, structural, or functional constraints characterizing a protein family in a concise, visually intuitive format [1]. In our present era, a biological data explosion has occurred and also a great acceleration in the accumulation of biological knowledge began. The reasons for the biological data explosion are the revolutionary recombinant DNA technology used for DNA sequencing and the latest evolution of Genome Sequencing Projects. So, it is easier to obtain the DNA sequence of the gene corresponding to an RNA or protein than it is to experimentally determine its function or its structure. Because of this, the size of sequence databases (e.g. Genbank maintained by NCBI, USA) is larger than the size of structure databases (e.g. PDB, maintained by RCSB, USA), to date. This provides a strong motivation for developing computational methods that can infer biological information from sequence alone. With the advent of modern computers and information technology, the biological data have not only been stored in the computer in the form of databases but REVIEW ARTICLE OPEN ACCESS