Mao et al. / J Zhejiang Univ-Sci C (Comput & Electron) 2011 12(4):263-272 263 Journal of Zhejiang University-SCIENCE C (Computers & Electronics) ISSN 1869-1951 (Print); ISSN 1869-196X (Online) www.zju.edu.cn/jzus; www.springerlink.com E-mail: jzus@zju.edu.cn Structural visualization of sequential DNA data Xiao-hong MAO §1 , Jing-hua FU §2 , Wei CHEN †‡2 , Qian YOU 3 , Shiao-fen FANG 3 , Qun-sheng PENG 2 ( 1 The Second Affiliated Hospital, School of Medicine, Zhejiang University, Hangzhou 310013, China) ( 2 State Key Lab of CAD & CG, Zhejiang University, Hangzhou 310058, China) ( 3 Department of Computer and Information Science, Indiana University-Purdue University Indianapolis (IUPUI), Indianapolis, IN 46202, USA) E-mail: chenwei@cad.zju.edu.cn Received Apr. 11, 2010; Revision accepted July 5, 2010; Crosschecked Jan. 31, 2011 Abstract: To date, comparing and visualizing genome sequences remain challenging due to the large genome size. Existing approaches take advantage of the stable property of oligonucleotides and exhibit the main characteristics of the whole genome, yet they commonly fail to show progression patterns of the genome adjustably. This paper presents a novel visual encoding technique, which not only supports the binning process (phylogenetic analysis), but also allows the sequential analysis of the genome. The key idea is to regard the combination of each k -nucleotide and its reverse complement as a visual word, and to represent a long genome sequence with a list of local statistical feature vectors derived from the local frequency of the visual words. Experimental results on a variety of examples demonstrate that the presented approach has the ability to quickly and intuitively visualize DNA sequences, and to help the user identify regions of differences among multiple datasets. Key words: Genome sequence, Sequential visualization, Bio-information visualization doi: 10.1631/jzus.C1000091 Document code: A CLC number: TP391.1; R394.3 1 Introduction To study the differences and similarities among different organisms, biologists have focused on the intrinsic properties of their corresponding genome sequences. Particularly, one genome sequence con- tains many chromosomes which consist of four chem- ical bases (Adenine (A), Thymine (T), Cytosine (C), Guanine (G)) and their attached nucleotides. The genome sequence has special properties in terms of the frequencies of k -nucleotides (1 <k 6) (Zhou et al., 2008). In genetics, A pairs with T and C pairs with G. Thus, the reverse complement of a DNA sequence is its reverse, complement, or reverse- Corresponding author § The two authors contributed equally to this work * Project supported by the National Natural Science Foundation of China (Nos. 60873123 and 60903085), the National Basic Re- search Program (973) of China (No. 2010CB732504), the Natural Science Foundation of Zhejiang Province (No. Y1080618), and the Open Project Program of the State Key Lab of CAD & CG, Zhejiang University, China (No. A0905) c Zhejiang University and Springer-Verlag Berlin Heidelberg 2011 complement counterpart. For instance, the reverse and reverse complement of a five-nucleotide ACAGT are TGACA and ACTGT, respectively. The relative combined frequency of ACAGT and ACTGT over a sequence <y 1 ,y 2 , ..., y N >, y i ∈{A,C,G,T} is given by 1 N - 5+1 N-5+1 i=1 (δ yi,A δ yi+1,C δ yi+2,A δ yi+3,G δ yi+4,T + δ yi,A δ yi+1,C δ yi+2,T δ yi+3,G δ yi+4,T ), (1) where δ a,b = 1, a = b, 0, a = b. Traditionally, scientists have sought to study different organisms by analyzing individual homol- ogous genes, markers, single-nucleotide polymor- phisms (SNPs) , and other features from the genome sequences. These features, however, require compre- hensive domain knowledge, not to mention that the biological meanings of many segments in a genome sequence are still unclear. For instance, the evolu-