TEMPLATE DESIGN © 2008
www.PosterPresentations.com
! "#$ #
• Clustered Regularly Interspaced Short
Palindromic Repeat(CRISPR) are genomic features of
many bacterial and archael species.
•This region is composed of direct repeats of 24-48
bases long, separated by non – repetitive
spacer sequences that have approximately same
length.
•The CRISPR regions appear to be among the most
rapidly evolving elements in genome –closely related
species and stains, sometimes more than 99% identical
at the DNA level , differ in their CRISPR composition.
!#% #&
&’ ( "#$
)
* " )"* "
.
Analyzing DNA Sequences Through Graph Entropy and Chaos Game
Tale of Two Ideas ("#$ # +, (% $-./012 3 "% (% -./1422
Dipendra C. Sengupta*, Jharna D. Sengupta*, & Scott Funkhouser**
*Department of Mathematics & Computer Science , Elizabeth City State University, NC
**SPAWAR SYSTEMS CENTER ATLANTIC @ Charleston, SC
5$ & 6
δ
δ
δ
δ
!"# "" $ % &"’ &
In our project, we analyzed DNA sequences using statistical method such
as graph entropy and mapping rules such as Chaos Game Representation.
The general aim was to analyze DNA sequences and find interesting
sections of a genome using a new formulation of Shannon like graph
entropy and to understand the characteristics of genome through
visualization.
We developed a Graph Entropy tool using MATLAB to identify Tandem
repeats and Direct repeats of genome. We have done experiment on 26
species and found many tandem repeats and direct repeats(CRISPR for
bacteria or archaea ) ; some of them are new and some of them are already
known. There are several existing separate CRISPR or Tandem finder
tools but our entropy can find both of these features if present in genome.
We developed a Mathematica program to compute CGR graphics and it’s
fractal dimension. We observed similarity of CGR and dinucleotide
probability matrix within chromosome and dissimilarity among genomes.
DNA sequence is comprised of different nucleotides : adenine(A),
cytosine(C), guanine(G), and thymine(T). Since the DNA molecule contains
plentiful biological, physical, and chemical information, it has become very
important to analyze DNA sequences statistically. Now the nucleotides stored
in GenBank have exceeded hundreds of millions of bases and the increasing
rate is considerably rapid. Therefore, biologists, physicists, mathematicians ,
and computer specialists have adopted different techniques to research DNA
sequences in recent years , including the statistical methods and some mapping
rules of the bases.
http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html
Major challenge: how to decipher the enormously long
nucleotide sequences that are being uncovered in the
genomes of living organisms?
Our main objectives are to analyze DNA sequences to
determine interesting sections of genome such as repeating
features ( CRISPR or Tandem Repeats) using
mathematical/statistical tools and to understand the
characteristics of genome through visualization.
Identification of repeats in genomic sequences is
important for a variety of application in Biology
It is important to mathematically distinguish these regions
for saving valuable time for a better experimental design.
7
% #& !! (
!" !"# )! !
!%#* """
! +! &
#& !! ( !!
,+! +!
!%#8
AAGACGGCAGCTAAAAAAGCTCCAGTAAGAAAAGTCGCAGCTAAGAAGACTGTGG
CTCGTAAAACTGTAGCTAAAAAAGCAGTAGCAGCTCGCAAAACAGTAGCTAAAAAA
TCTGTAGCAGCTAGAAAGACGGCAGCTAAAAAAGCTCCAGTAAGAAAAGTTGCAG
CTAAGAAGACTGTGGCTCGCAAGACTGTAGCTAAAAAAGCAGTAGCAGCT…….
!%# ( &$ &9’* """""
4 : ;$ <#
→ → → → → → → → → →
→ → → →
! =
& && & & &
- - - . - - - -
- - - - - - - 4
- - - - . - - -
- - - - - . . -
- - - - - - - .
. . - - - - - -
- - . . - - - -
- - - - - . 4 -
&
&&
&
&
&
0
7
1
14
1
0 0 0 0 0
0 0 0 0
14
1
14
1
0 0
0 0 0 0 0 0
14
1
14
1
14
1
0 0 0 0 0 0 0
0
14
1
14
1
0 0 0 0 0
0 0 0
14
1
0 0 0 0
7
1
0 0 0 0 0 0 0
0 0 0 0
14
1
0 0 0
-
-
-
-
-
-
-
-
14
3
1 0 0 0 0 0 0 0
0
14
2
1 0 0 0 0 0 0
0 0
14
2
1 0 0 0 0 0
0 0 0
14
1
1 0 0 0 0
0 0 0 0
14
2
1 0 0 0
0 0 0 0 0
14
1
1 0 0
0 0 0 0 0 0
7
1
1 0
0 0 0 0 0 0 0
14
1
1
0
49
6
49
3
0 0 0 0 0
0 0 0 0
49
3
196
13
0 0
0 0 0 0 0 0
49
3
196
13
196
11
0 0 0 0 0 0 0
0
49
3
49
3
0 0 0 0 0
0 0 0
196
13
0 0 0 0
98
11
0 0 0 0 0 0 0
0 0 0 0
49
3
0 0 0
0 0 0 0
343
3
1372
13
686
3
2744
13
0
686
3
686
3
2744
13
0 0 0 0
1372
11
0 0 0
686
3
0 0 0
0
343
3
686
3
0 0 0 0 0
0 0 0 0
686
3
2744
13
686
3
2744
13
2744
11
0 0 0 0 0 0 0
0
343
3
343
3
0 0 0 0 0
0
686
3
686
3
0 0 0 0 0
( )
= Ω
= ∑Ω ∑
∞
=
= + + + = Ω = Ω
00061 . 12319 . 06192 . 00068 . 00911 . 00953 . 00442 . 00479 .
00032 . 00444 . 00442 . 00476 . 06156 . 06667 . 00031 . 00034 .
00802 . 00156 . 00094 . 00000 . 00448 . 00012 . 06129 . 06639 .
05616 . 00879 . 00442 . 00004 . 00065 . 00068 . 00031 . 00034 .
00059 . 06165 . 06160 . 00034 . 00471 . 00477 . 00440 . 00476 .
00401 . 00062 . 00031 . 06633 . 00004 . 00004 . 00002 . 00002 .
11233 . 01759 . 00884 . 00009 . 00130 . 00136 . 00063 . 00068 .
00004 . 00440 . 00440 . 00002 . 06156 . 00034 . 00031 . 00034 .
. 1
,
;
1
.......
3 2
lengths possible all of walk random generating for matrix Prob.
sequence, our For
j i
ij
l
Q
l
P Q P Q P PQ ij
) log( ) (
,
ij
j i
ij H Ω Ω - = Ω ∑ +#$ #8
(" .*
*//!,"0//
$ * eukaryotes: "’ "’ !’
’
#>&* ,!’ !
<’&
&0 1",
& # ,# ! ’ #!" " 2
345 , 2 4-, 6 4 0"
Low entropy=high repeatability/CRISPR
#
"% +$
=
7 =
" 4 8
7 =
Acidovorax(Bacteria) – Entropy
Graph
Lowest Drop: (x=871100, y=4.088) ; Position (871000, 871600)
ATAAAAAAACCCGGTGCATGCACCGGGTGGGACCAGCCCCGCGGGCGGGGCGGCTGGCTGCTGTC
GTCGCTCAGGGCTTGGTGCCCGTCGGGAAGGGCCATGCGGCCTGCGGGTTCAGCGTGGTCTGTGCT
GCGGGTGCAGGCGCAGGGGCAGAGGCCTTGGAGGCCGCCTTTTTCGGGGCAGCCTTCTTCGGT
GCAGCGGCCTTGGTCGTGCCGGTGGCCTTCTTC
GCCGGTGCAGCTGCCTTCTTGGTGGAGGCTGCGGCCTTCTTT
GCCGGTGCAGCTGCCTTCTTGGCGGGGGCTGCGGCCTTCTTC
GCCGGTGCAGCTGCCTTCTTGGCGGGGGCTGCGGCCTTCTTT
GCCGGTGCAGCTGCCTTCTTGGCAGGAGCTGCGGCCTTCTTT
GCCGGTGCAGCTGCCTTCTTGGCGGGGGCTGCGGCCTTCTTT
GCCGGTGCAGCTGCCTTCTTGGCAGGAGCTGCGGCCTTCTTT
GCCGGTGCAGCTGCCTTCTTGGCAGGAGCTGCGGCCTTCTTT
GCCGGTGCAGCTGCCTTCTTGGCGGGGGCTGCAGCCTTCTTC
GCCGGAGCGGCCTTCTTCGTCGTGGCGGCGGCCTTCTT
strfind(g,'GCCGGTGCAGCTGCCTTCTTGG')
871227 871269 871311 871353 871395 871437 871479 871521
These are tandem repeats.
Human Chromosome -21
X=44010000, y=4.13 ; Position interval(44009900:44010500)
CCGTTTATATCCACGCAGGCG
TTTCCCCTTACCTGCACCGAGCCTCCATTCCCGTTTATATCCACGCAGGCG
TTTCCCCTTACCTGCACCGAGCCTCCCGCCCCGTTTACATCCACGCAGGCG
TTTCCCCTTACCTGCACCGAGCCTCCATTCCCGTTTATATCCACGCAGGCG
TTTCCCCTTACCTGCACCGAGCCTCCCGCCCCGTTTACATCCACGCAGGCG
TTTCCCCTTACCTGCACCGAGCCTCCATTCCCGTTTATATCCACGCAGGCG
TTTCCCCTTACCTGCACCGAGCCTCCCGCCCCGTTTACATCCACGCAGGCG
TTTCCCCTTACCTGCACCGAGCCTCCATTCCCGTTTATATCCACGCAGGCG
TTTCCCCTTACCTGCACCGAGCCTCCATTCCCGTTTATATCCACGCAGGCG
TTTCCCCTTACCTGCACCGAGCCTCCATTCCCGTTTATATCCACGCAGGCG
TTTCCCCTTACCTGCACCGAGCCTCCCGCCCCGTTTATATCCACGCAGGCG
TTTCCCCTTACCTGCACCGGGCCTGCCGCCCCGTTTACATCCACGCATGCG
TTTCCCCTTACCTGCACTG
This string 'TTTCCCCTTACCTGCACCGAGCCTCCATTCCCGTTTATATCCACGCAGGCG'
is repeating 18 times!
This string 'TTTCCCCTTACCTGCACCGAGCCTCCCGCCCCGTTTACATCCACGCAGGCG'
is repeating 23 times!
Eukaryotes :
Homo sapiens(human) chromosome 19 & 21 ,
Anopheles gambiae( insect),Caenorhabditis elegans
(worm),Plasmodium falciparum (causes Malaria),
Saccharomyces cerevisiae(yeast)
Prokaryotes:
Acidovorax,Ammonifex, Caldicellulosiruptor
kristjanssonii, E.Coli, Salmonella Typhi, Listeria
Monocytogenes, Bacillus clausii KSM, Chlamydia
muridarum Nigg, Cyanobacterium aponinum ,
Gluconacetobacter diazotrophicus, Haemophilus
influenzae R2866, Mycobacterium tuberculosis,
Mycoplasma genitalium, Neisseria meningitidis,
Streptococcus pneumoniae , Thermosipho africanus ,
Truepera radiovictrix ( Bacteria), A. fulgidus( Archaea)
Viruses
HIV, Hepatitis B
?5)"
*; @
7&&(
7,
7+
$&
&+
%’ (
%#!
(%6
View publication stats View publication stats