TEMPLATE DESIGN © 2008 www.PosterPresentations.com        !  "#$ # Clustered Regularly Interspaced Short Palindromic Repeat(CRISPR) are genomic features of many bacterial and archael species. •This region is composed of direct repeats of 24-48 bases long, separated by non – repetitive spacer sequences that have approximately same length. •The CRISPR regions appear to be among the most rapidly evolving elements in genome –closely related species and stains, sometimes more than 99% identical at the DNA level , differ in their CRISPR composition. !#%  #& &’ (   "#$ ) * " )"*  " .      Analyzing DNA Sequences Through Graph Entropy and Chaos Game Tale of Two Ideas ("#$ # +, (% $-./012 3 "% (% -./1422 Dipendra C. Sengupta*, Jharna D. Sengupta*, & Scott Funkhouser** *Department of Mathematics & Computer Science , Elizabeth City State University, NC **SPAWAR SYSTEMS CENTER ATLANTIC @ Charleston, SC 5$ & 6 δ δ δ δ              !"# ""  $ % &"’ & In our project, we analyzed DNA sequences using statistical method such as graph entropy and mapping rules such as Chaos Game Representation. The general aim was to analyze DNA sequences and find interesting sections of a genome using a new formulation of Shannon like graph entropy and to understand the characteristics of genome through visualization. We developed a Graph Entropy tool using MATLAB to identify Tandem repeats and Direct repeats of genome. We have done experiment on 26 species and found many tandem repeats and direct repeats(CRISPR for bacteria or archaea ) ; some of them are new and some of them are already known. There are several existing separate CRISPR or Tandem finder tools but our entropy can find both of these features if present in genome. We developed a Mathematica program to compute CGR graphics and it’s fractal dimension. We observed similarity of CGR and dinucleotide probability matrix within chromosome and dissimilarity among genomes. DNA sequence is comprised of different nucleotides : adenine(A), cytosine(C), guanine(G), and thymine(T). Since the DNA molecule contains plentiful biological, physical, and chemical information, it has become very important to analyze DNA sequences statistically. Now the nucleotides stored in GenBank have exceeded hundreds of millions of bases and the increasing rate is considerably rapid. Therefore, biologists, physicists, mathematicians , and computer specialists have adopted different techniques to research DNA sequences in recent years , including the statistical methods and some mapping rules of the bases. http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html Major challenge: how to decipher the enormously long nucleotide sequences that are being uncovered in the genomes of living organisms? Our main objectives are to analyze DNA sequences to determine interesting sections of genome such as repeating features ( CRISPR or Tandem Repeats) using mathematical/statistical tools and to understand the characteristics of genome through visualization. Identification of repeats in genomic sequences is important for a variety of application in Biology It is important to mathematically distinguish these regions for saving valuable time for a better experimental design.    7 % #& !!  (        !"       !"# )!  !  !%#* """  !  +! &      #& !!  (       !! ,+!     +! !%#8 AAGACGGCAGCTAAAAAAGCTCCAGTAAGAAAAGTCGCAGCTAAGAAGACTGTGG CTCGTAAAACTGTAGCTAAAAAAGCAGTAGCAGCTCGCAAAACAGTAGCTAAAAAA TCTGTAGCAGCTAGAAAGACGGCAGCTAAAAAAGCTCCAGTAAGAAAAGTTGCAG CTAAGAAGACTGTGGCTCGCAAGACTGTAGCTAAAAAAGCAGTAGCAGCT……. !%# (  &$ &9’* """"" 4 :  ;$  <#                 ! = &  && & &  &  - - - . - - - - - - - - - - - 4 - - - - . - - - - - - - - . . - - - - - - - - . . . - - - - - - - - . . - - - - - - - - - . 4 - &  && & &  &  0 7 1 14 1 0 0 0 0 0 0 0 0 0 14 1 14 1 0 0 0 0 0 0 0 0 14 1 14 1 14 1 0 0 0 0 0 0 0 0 14 1 14 1 0 0 0 0 0 0 0 0 14 1 0 0 0 0 7 1 0 0 0 0 0 0 0 0 0 0 0 14 1 0 0 0 - - - - - - - - 14 3 1 0 0 0 0 0 0 0 0 14 2 1 0 0 0 0 0 0 0 0 14 2 1 0 0 0 0 0 0 0 0 14 1 1 0 0 0 0 0 0 0 0 14 2 1 0 0 0 0 0 0 0 0 14 1 1 0 0 0 0 0 0 0 0 7 1 1 0 0 0 0 0 0 0 0 14 1 1 0 49 6 49 3 0 0 0 0 0 0 0 0 0 49 3 196 13 0 0 0 0 0 0 0 0 49 3 196 13 196 11 0 0 0 0 0 0 0 0 49 3 49 3 0 0 0 0 0 0 0 0 196 13 0 0 0 0 98 11 0 0 0 0 0 0 0 0 0 0 0 49 3 0 0 0 0 0 0 0 343 3 1372 13 686 3 2744 13 0 686 3 686 3 2744 13 0 0 0 0 1372 11 0 0 0 686 3 0 0 0 0 343 3 686 3 0 0 0 0 0 0 0 0 0 686 3 2744 13 686 3 2744 13 2744 11 0 0 0 0 0 0 0 0 343 3 343 3 0 0 0 0 0 0 686 3 686 3 0 0 0 0 0 ( ) = Ω = Ω = = + + + = Ω = Ω 00061 . 12319 . 06192 . 00068 . 00911 . 00953 . 00442 . 00479 . 00032 . 00444 . 00442 . 00476 . 06156 . 06667 . 00031 . 00034 . 00802 . 00156 . 00094 . 00000 . 00448 . 00012 . 06129 . 06639 . 05616 . 00879 . 00442 . 00004 . 00065 . 00068 . 00031 . 00034 . 00059 . 06165 . 06160 . 00034 . 00471 . 00477 . 00440 . 00476 . 00401 . 00062 . 00031 . 06633 . 00004 . 00004 . 00002 . 00002 . 11233 . 01759 . 00884 . 00009 . 00130 . 00136 . 00063 . 00068 . 00004 . 00440 . 00440 . 00002 . 06156 . 00034 . 00031 . 00034 . . 1 , ; 1 ....... 3 2 lengths possible all of walk random generating for matrix Prob. sequence, our For j i ij l Q l P Q P Q P PQ ij ) log( ) ( , ij j i ij H Ω Ω - = Ω +#$ #8 ("    .* *//!,"0// $   * eukaryotes: "’ "’ !’  ’  #>&* ,!’ ! <’& &0   1",  &  # ,# !   ’ #!" " 2 345 ,   2 4-,  6   4 0" Low entropy=high repeatability/CRISPR # "% +$ = 7 =  " 4 8 7 = Acidovorax(Bacteria) – Entropy Graph Lowest Drop: (x=871100, y=4.088) ; Position (871000, 871600) ATAAAAAAACCCGGTGCATGCACCGGGTGGGACCAGCCCCGCGGGCGGGGCGGCTGGCTGCTGTC GTCGCTCAGGGCTTGGTGCCCGTCGGGAAGGGCCATGCGGCCTGCGGGTTCAGCGTGGTCTGTGCT GCGGGTGCAGGCGCAGGGGCAGAGGCCTTGGAGGCCGCCTTTTTCGGGGCAGCCTTCTTCGGT GCAGCGGCCTTGGTCGTGCCGGTGGCCTTCTTC GCCGGTGCAGCTGCCTTCTTGGTGGAGGCTGCGGCCTTCTTT GCCGGTGCAGCTGCCTTCTTGGCGGGGGCTGCGGCCTTCTTC GCCGGTGCAGCTGCCTTCTTGGCGGGGGCTGCGGCCTTCTTT GCCGGTGCAGCTGCCTTCTTGGCAGGAGCTGCGGCCTTCTTT GCCGGTGCAGCTGCCTTCTTGGCGGGGGCTGCGGCCTTCTTT GCCGGTGCAGCTGCCTTCTTGGCAGGAGCTGCGGCCTTCTTT GCCGGTGCAGCTGCCTTCTTGGCAGGAGCTGCGGCCTTCTTT GCCGGTGCAGCTGCCTTCTTGGCGGGGGCTGCAGCCTTCTTC GCCGGAGCGGCCTTCTTCGTCGTGGCGGCGGCCTTCTT strfind(g,'GCCGGTGCAGCTGCCTTCTTGG') 871227 871269 871311 871353 871395 871437 871479 871521 These are tandem repeats. Human Chromosome -21 X=44010000, y=4.13 ; Position interval(44009900:44010500) CCGTTTATATCCACGCAGGCG TTTCCCCTTACCTGCACCGAGCCTCCATTCCCGTTTATATCCACGCAGGCG TTTCCCCTTACCTGCACCGAGCCTCCCGCCCCGTTTACATCCACGCAGGCG TTTCCCCTTACCTGCACCGAGCCTCCATTCCCGTTTATATCCACGCAGGCG TTTCCCCTTACCTGCACCGAGCCTCCCGCCCCGTTTACATCCACGCAGGCG TTTCCCCTTACCTGCACCGAGCCTCCATTCCCGTTTATATCCACGCAGGCG TTTCCCCTTACCTGCACCGAGCCTCCCGCCCCGTTTACATCCACGCAGGCG TTTCCCCTTACCTGCACCGAGCCTCCATTCCCGTTTATATCCACGCAGGCG TTTCCCCTTACCTGCACCGAGCCTCCATTCCCGTTTATATCCACGCAGGCG TTTCCCCTTACCTGCACCGAGCCTCCATTCCCGTTTATATCCACGCAGGCG TTTCCCCTTACCTGCACCGAGCCTCCCGCCCCGTTTATATCCACGCAGGCG TTTCCCCTTACCTGCACCGGGCCTGCCGCCCCGTTTACATCCACGCATGCG TTTCCCCTTACCTGCACTG This string 'TTTCCCCTTACCTGCACCGAGCCTCCATTCCCGTTTATATCCACGCAGGCG' is repeating 18 times! This string 'TTTCCCCTTACCTGCACCGAGCCTCCCGCCCCGTTTACATCCACGCAGGCG' is repeating 23 times! Eukaryotes : Homo sapiens(human) chromosome 19 & 21 , Anopheles gambiae( insect),Caenorhabditis elegans (worm),Plasmodium falciparum (causes Malaria), Saccharomyces cerevisiae(yeast) Prokaryotes: Acidovorax,Ammonifex, Caldicellulosiruptor kristjanssonii, E.Coli, Salmonella Typhi, Listeria Monocytogenes, Bacillus clausii KSM, Chlamydia muridarum Nigg, Cyanobacterium aponinum , Gluconacetobacter diazotrophicus, Haemophilus influenzae R2866, Mycobacterium tuberculosis, Mycoplasma genitalium, Neisseria meningitidis, Streptococcus pneumoniae , Thermosipho africanus , Truepera radiovictrix ( Bacteria), A. fulgidus( Archaea) Viruses HIV, Hepatitis B ?5)" *; @ 7&&( 7, 7+ $& &+ %’ ( %#! (%6 View publication stats View publication stats