HUMAN MUTATION 28(5), 451^458, 2007 RESEARCH ARTICLE CAG and CTG Repeat Polymorphism in Exons of Human Genes Shows Distinct Features at the Expandable Loci Matylda Rozanska, Krzysztof Sobczak, Anna Jasinska, Marek Napierala, Danuta Kaczynska, Anna Czerny, Magdalena Koziel, Piotr Kozlowski, Marta Olejniczak, and Wlodzimierz J. Krzyzosiak Ã Laboratory of Cancer Genetics, Institute of Bioorganic Chemistry, Polish Academy of Sciences, Poznan, Poland Communicated by Nobuyoshi Shimizu Although the trinucleotide repeats are present in the exons of numerous human genes, the allele distribution is not well known, and the factors responsible for their intergenic and intragenic variability are not well understood. We have analyzed the length and sequence variation within the most commonly occurring CAG and CTG repeats in a large number of human genes selected to contain the longest reported repeat tracts. Our study revealed that in genes other than those implicated in the Triplet Repeat Expansion Diseases (TREDs), the very long and highly polymorphic repeats are rather infrequent. The length of pure repeat tract in the most frequent allele was found to correlate well with the rate of the repeat length polymorphism, and CAA triplets were shown to be the most frequent CAG repeat interruptions. As both the CAG and CAA triplets code for glutamine, our results may suggest that the selective pressure disfavors the long uninterrupted CAG repeats in genes and transcripts but not the long normal polyglutamine tracts in proteins. This may indicate that hairpin structures formed in ssDNA and RNA by long pure CAG repeats would be selected against as they may impede normal cellular processes. Hum Mutat 28(5), 451–458, 2007. r r 2007 Wiley-Liss, Inc. KEY WORDS: microsatellite genotyping; human genetic variation; repeat instability INTRODUCTION Microsatellites are the tandemly-repeated tracts of DNA composed of 1–6 base pair (bp)-long motifs, which occupy about 3% of the human genome [Subramanian et al., 2003]. The bioinformatic surveys of the microsatellite abundance and density in the genome showed that they are roughly equally distributed in all chromosomes and that the repeated trimers and hexamers are more frequent in exons than in introns, which implies that they have biological function [Toth et al., 2000; Subramanian et al., 2003]. A number of studies were performed to reveal the function of triplet repeats in the genome. The main questions addressed related to this problem were: How many genes contain trinucleo- tide repeat tracts and to what functional classes do these genes belong? What is the distribution of triplet repeats in the different functional regions of genes? Which amino acids do they encode? In answering these questions, numerous repeat-containing genes were identified. For example, more than 2,000 human genes were shown to harbor trinucleotide motifs repeated four or more times [Subramanian et al., 2003]. Of these, 171 genes contained repeat tracts composed of 10 or more repeated triplets. In another survey 619 human mRNAs were identified that contained trinucleotide motifs repeated at least six times [Jasinska et al., 2003]. When their distribution in different mRNA regions was examined it became apparent that they are strongly overrepresented in the 5 0 untranslated region (5 0 UTR) as compared to 3 0 UTR, which suggested their function in the regulation of translation initiation. The highest number of trinucleotide repeat tracts was found in a translatable sequence where they gave rise to homopolymeric runs of amino acid in proteins. The bioinformatic surveys of protein sequence databases revealed that about 20% of human proteins contain at least one tract composed of four or more identical amino acid residues [Karlin et al., 2002]. The most frequent amino acids in these stretches were glutamic acid (19.8%), leucine (19.0%), proline (18.2%), and alanine (16.9%). The biological function of mono-amino acid runs in proteins, in spite of many studies, remains poorly known. The important feature of microsatellites is their enormous mutability leading to the frequent length polymorphism of these sequences in a population. Due to this property they became very useful markers in genetic mapping and population genetics [Weissenbach et al., 1992; Gyapay et al., 1994; Ellegren, 2004]. Microsatellites are also significant components of human genetic Published online 16 January 2007 in Wiley InterScience (www. interscience.wiley.com). DOI 10.1002/humu.20466 The Supplementary Material referred to in this article can be accessed at http://www.interscience.wiley.com/jpages/1059-7794/ suppmat. Received 9 July 2006; accepted revised manuscript 19 November 2006. Grant sponsor: Sixth Research Framework Programme of the European Union, Project RIGHT; Grant number: LSHB-CT-2004 005276; Grant sponsor: State Committee for Scienti¢c Research; Grant numbers: PBZ-KBN-124/P05/2004, PBZ-MNiI-2/1/2005; and 2PO5A08826. Ã Correspondence to: Wlodzimierz J. Krzyzosiak, Laboratory of Cancer Genetics, Institute of Bioorganic Chemistry, Polish Acad- emy of Sciences, Noskowskiego12/14 St.,61-704 Poznan, Poland. E-mail: wlodkrzy@ibch.poznan.pl r r 2007 WILEY-LISS, INC.