Clustering of Protein Domains in the Human Genome Lianne R. Mayor 1 , Keiran P. Fleming 2 , Arne Mu¨ ller 2,3 David J. Balding 1 and Michael J.E. Sternberg 2 * 1 Department of Epidemiology and Public Health, Imperial College, St Mary’s Campus London W2 1PG, UK 2 Department of Biological Sciences and Centre for Bioinformatics, Imperial College, South Kensington Campus, London SW7 2AY UK 3 Biomolecular Modelling Laboratory, Cancer Research UK, 44 Lincoln’s Inn Fields London WC2A 3PX, UK We present a systematic study of the clustering of genes within the human genome based on homology inferred from both sequence and structural similarity. The 3D-Genomics automated proteome annotation pipeline (www.sbg.bio.ic.ac.uk/3dgenomics) was utilised to infer homology for each protein domain in the genome, for the 26 superfamilies most highly represented in the Structural Classification Of Proteins (SCOP) database. This approach enabled us to identify homologues that could not be detected by sequence-based methods alone. For each superfamily, we investigated the distribution, both within and among chromosomes, of genes encoding at least one domain within the superfamily. The results indicate a diversity of clustering behaviours: some superfamilies showed no evidence of any clustering, and others displayed significant clustering either within or among chromosomes, or both. Removal of tandem repeats reduced the levels of clustering observed, but some superfamilies still dis- played highly significant clustering. Thus, our study suggests that either the process of gene duplication, or the evolution of the resulting clusters, differs between structural superfamilies. q 2004 Elsevier Ltd. All rights reserved. Keywords: tandem repeats; gene clustering; protein domains; genome evolution; bioinformatics *Corresponding author Introduction Until recently, it was often supposed that there was no pattern to gene location in higher eukary- otic organisms. However, several studies have challenged this view, indicating that gene location in higher eukaryotes may not be totally random after all. There is evidence for clustering of the Hox, haemoglobin and immunoglobulin genes, 1 and clustering of co-expressed genes has been described for Caenorhabditis elegans 1,2 (due to the presence of bacteria-like operons) 3 and Drosophila melanogaster. 4,5 Several reports have also demonstrated the apparent clustering of genes within the human genome. 6–9 A detailed analysis of chromosome 22 illustrated a number of duplicated regions. 10 Significant clustering has been found for those genes expressed in most tissues 6 (i.e. housekeeping genes) following examination of SAGE 11 (serial analysis of gene expression) data for 14 different tissue types. Genes expressed in a tissue-specific manner showed no clustering behaviour. A second study 8 focused on genes involved in the same metabolic pathway, as defined in KEGG. 12 Several completely sequenced eukaryotic genomes were studied and the data supported a high level of clustering in Homo sapiens, even more than that found in C. elegans and other simpler eukaryotes. Eukaryotic gene clusters containing gene dupli- cates could be produced via a number of different mechanisms; tandem duplication of single genes, segmental duplication, polyploidization and repli- cative translocation to name a few. 13 Research has shown that within higher eukaryotic genomes there are multiple copies of gene family members for many functional proteins, 14,15 with 40–50% of the genomes of C. elegans and D. melanogaster, for example, consisting of duplicated genes. 16 The most common fate following gene duplication by far is for one of the duplicates to be silenced. 17 Our own genome has undergone many dupli- cation events throughout its evolution. Relatively large inter and intra-chromosomal segmental duplications have been observed for all of the nuclear chromosomes, 18 accounting for approxi- mately 5% of the genome. 19 – 22 These duplication 0022-2836/$ - see front matter q 2004 Elsevier Ltd. All rights reserved. L.R.M. and K.P.F. contributed equally to this work. Present address. Arne Mu¨ller, Aventis Pharma, Drug Safety Evaluation, 13 quai Jules Guesde, 94403 Vitry-sur- Seine Cedex, France. E-mail address of the corresponding author: m.sternberg@imperial.ac.uk doi:10.1016/j.jmb.2004.05.036 J. Mol. Biol. (2004) 340, 991–1004