A rapid method of whole genome visualisation illustrating features in both coding and non-coding regions R. Hall L. Stern Dept. of Computer Science and Software Engineering The University of Melbourne, Melbourne, Victoria 3010 Australia, e-mail rshall@cs.mu.oz.au Abstract The application of Fourier analysis to a genome can be used as an indicator of gene coding regions. We have developed a visualisation of the Fourier spectra that al- lows convenient whole chromosome scanning for genes and other features. The method’s rapid operation suits its application as a first pass analysis. Fourier analysis indi- cates a strong periodicity of 3 in coding regions of sev- eral different organisms and is independent of the orienta- tion of the gene. A bitmap display of the Fourier spec- tra over a sliding window gives rapid visualisation and localisation of coding regions in the chromosomes of a number of different organisms. Non-coding features such as regions of repetitive DNA, are visualised at the same time. The method works particularly well on organisms with a skewed base composition such as the malaria par- asite Plasmodium falciparum and the protozoan Leishma- nia major. Keywords: Fourier analysis; Plasmodium falciparum; Re- peats; genome visualisation. 1 Introduction The frequency content of a signal can be determined by the analysis of its Fourier transform (O’Neil 1991). When DNA is viewed as a signal, a discrete Fourier transform provides a method suited to the detection of periodic ar- rangements of bases in a genome. Before a spectral anal- ysis of a genomic sequence can be performed, it must be converted from a string of four component bases to a numerical array. The numerical representation deter- mines which features of the genome are highlighted by the analysis. This translation can be performed in a num- ber of ways. Silverman and Linsker (1986) represent each base as the vertex of a tetrahedron in three dimensional space and the genome sequence is transformed into an array composed of three dimensional vectors. A Fourier transform is performed on each of the three sequences made up of a directional component from the sequence vectors. The resulting spectrum is the sum of the three Fourier transforms. Tiwari et al. (1997) use four binary strings to represent the occurrence of each base in the nu- cleotide sequence, summing the individual spectra to give an overall sample spectrum for the genome. Using this method, a repeating period of 3 has been found in cod- ing regions (Fickett & Tung 1992, Tiwari et al. 1997). Fourier analysis has also been employed as one of a num- ber of weighting factors in the determination of intron splice sites (Huestis & Saul 2001). Copyright c 2004, Australian Computer Society, Inc. This pa- per appeared at The Second Asia-Pacific Bioinformatics Conference (APBC2004), Dunedin, New Zealand. Conferences in Research and Practice inInformation Technology, Vol. 29. Yi-Ping Phoebe Chen, Ed. Reproduction for academic, not-for profit purposes permitted provided this text is included. For a clear Fourier spectrum to be generated, a reason- able sized sample is required. Larger samples give rel- atively poor resolution, making it difficult to locate fea- tures exactly using only the Fourier transform. This pa- per shows that an appropriate visual representation of the Fourier analysis can provide a guide to gene location and other periodic features in the genome. 2 Algorithm A numerical representation of a sample of DNA is con- structed for each of the four nucleotides. The four ar- rays are implemented using binary strings as described by Tiwari et al. (1997), where each occurrence of the rele- vant base is indicated by a 1 and any other base, by a 0. For example, a short sequence such as ACTGAGCTA is transformed into the four strings: 100010001 referring to the A nucleotide, 010000100 for C, 000101000 for G and 001000010 for T. A Fourier transform was performed on each array, giving a spectrum which indicates if that par- ticular base is appearing periodically in the DNA sample. The sum of the squares of the individual spectrum com- ponents is taken to produce an overall Fourier analysis for the particular sample of the genome. The algorithm was implemented in C. An overall view of a genomic sequence is constructed by taking sequential, overlapping samples and perform- ing a combined Fourier analysis on each. The samples are overlapped to detect periodicies spanning two adjacent samples. The spectrum from each analysis is scaled by a constant factor and converted to a column of grey-scale pixels. The resulting bitmap has a width of the genome length divided by half the sample size and a depth of half the sample size (only the first half of the Fourier transform being significant). The line of pixels representing each Fourier sum is scaled and transformed such that a high power in the spec- trum is represented by a dark pixel, while low power val- ues are assigned light pixels. The scaling factor provides a means of enhancing lower peaks as all peaks over an ar- bitrary value can be marked black. This allows spectrum powers which are only marginally above the background noise to be assigned the same visual significance as major peaks. Due to the sequential sampling, small peaks over a number of contiguous samples produce a distinct line in the bitmap. For example using a sample of 256 bases, a continuous horizontal line at a pixel column height of 85 indicates a strong period 3 in these samples (256/85 = 3). Using a sample size of 512 bases, a one million base genome produces a grey-scale bitmap in the PGM format 256 pixels x 3900 pixels in about 20 seconds on a 333 MHz Sun Ultra Sparc. The convenient size means that multiple bitmaps can be viewed and compared using stan- dard graphics software. An X Window program is in de- velopment which will allow convenient scrolling of the bitmap image along with scales giving sequence offsets and periodicity values. 285