Stat. Appl. Genet. Mol. Biol. 2015; 14(2): 113–123 Yulia M. Suvorova* and Eugene V. Korotkov Study of triplet periodicity differences inside and between genomes Abstract: Triplet periodicity (TP) is a distinctive feature of the protein coding sequences of both prokaryotic and eukaryotic genomes. In this work, we explored the TP difference inside and between 45 prokaryotic genomes. We constructed two hypotheses of TP distribution on a set of coding sequences and generated artificial datasets that correspond to the hypotheses. We found that TP is more similar inside a genome than between genomes and that TP distribution inside a real genome dataset corresponds to the hypothesis which implies that a common TP pattern exists for the majority of sequences inside a genome. Additionally, we per- formed gene classification based on TP matrixes. This classification showed that TP allows identification of the genome to which a given gene belongs with more than 85% accuracy. Keywords: gene classification; genomes comparison; protein coding genes; triplet periodicity. DOI 10.1515/sagmb-2013-0063 1 Introduction It is well known that biological sequences contain periodicities of different period lengths (Trifonov and Sussman, 1980; Trifonov, 1998) and this periodicity could be explicit or hidden (Korotkov et al., 1999, 2003). The most-known periodicity type is triplet periodicity (abbreviated as TP) of genes encoding proteins (Fickett and Tung, 1992; Konopka, 1994; Li, 1997; Trifonov, 1999; Gao et al., 2005) and periodicity of larger periods divided by three (Korotkov et al., 1999, 2003). TP is a distinguishing property of protein coding sequences; and it is absent in non-coding parts or introns (Fickett, 1982; Eskesen et al., 2004). TP is characterized by unequal nucleotide distribution in different codon positions. The TP property is found in all living organisms ranging from bacteria to mammals. Different methods were developed for TP detection in a sequence: Fourier analysis (Makeev and Tuman- yan, 1996; Tiwari et al., 1997; Yan et al., 1998), wavelet decomposition (Mena-Chalco et al., 2008), frequency distribution of distances (FDD) for triplets (López-Villaseñor et al., 2004) and information decomposition (Korotkov et al., 2003). TP presence in the coding sequence is associated with different factors; firstly with the triplet structure of the genetic code; codon and amino acid bias (Antezana and Kreitman, 1999; Eskesen et al., 2004; Zoltowski, 2007); origin from the ancient RNA (so-called RNY) (Shepherd, 1981; Eskesen et al., 2004); gene expression process (Trotta, 2011) and the same-phase triplet clustering (Sánchez and López-Villaseñor, 2006). TP feature was used in computer analysis of nucleotide sequence: for coding sequence identifica- tion (Fickett, 1982; Bernaola-Galván et al., 2000; Yin and Yau, 2007; Chen and Ji, 2012), mutation detection (Frenkel and Korotkov, 2009; Suvorova et al., 2012) and even evolution study (Tsonis et al., 1991). The set of TP matrixes of a genome is determined by genes present in the genome. The TP of a gene is determined by the genetic code, the preferred synonymous codons as well as the amino acid composition of *Corresponding author: Yulia M. Suvorova, Bioinformatics Laboratory, Centre of Bioengineering of the Russian Academy of Sciences, 117312, Prospect 60-tya Oktyabrya, Moscow, Russian Federation, e-mail: suvorovay@gmail.com Eugene V. Korotkov: Bioinformatics Laboratory, Centre of Bioengineering of the Russian Academy of Sciences, 117312, Prospect 60-tya Oktyabrya, Moscow, Russian Federation; and Department of Applied Mathematics, National Nuclear Investigational University (MIFI), 115522, Kashirskoe Shosse, 31, Moscow, Russian Federation Authenticated | suvorovay@gmail.com author's copy Download Date | 4/23/15 8:13 PM