Combination of Several Bioinformatics Approaches for the Identification of New Putative Glycosyltransferases in Arabidopsis Sara Fasmer Hansen, † Emmanuel Bettler, ‡ Michaela Wimmerova ´, § Anne Imberty, † Olivier Lerouxel, † and Christelle Breton* ,† CERMAV-CNRS (affiliated to ICMG), University of Grenoble, F-38041 Grenoble, France, CNRS, UMR 5086, IBCP, University of Lyon, F-69367 Lyon, France, and Department of Biochemistry and NCBR, Faculty of Sciences, Masaryk University, 61137 Brno, Czech Republic Received September 24, 2008 Approximately 450 glycosyltransferase (GT) sequences have been already identified in the Arabidopsis genome that organize into 40 sequence-based families, but a vast majority of these gene products remain biochemically uncharacterized open reading frames. Given the complexity of the cell wall carbohydrate network, it can be inferred that some of the biosynthetic genes have not yet been identified by classical bioinformatics approaches. With the objective to identify new plant GT genes, we designed a bioinformatic strategy that is based on the use of several remote homology detection methods that act at the 1D, 2D, and 3D level. Together, these methods led to the identification of more than 150 candidate protein sequences. Among them, 20 are considered as putative glycosyltransferases that should further be investigated since known GT signatures were clearly identified. Keywords: fold recognition • glycosyltransferase • DUF266 • DUF246 • profile HMMs • structural overlap Introduction Glycosyltransferases (GTs; EC 2.4.x.y) constitute a large family of enzymes that are involved in the biosynthesis of oligosac- charides, polysaccharides, and glycoconjugates. 1 These mol- ecules of enormous diversity mediate a wide range of functions from structure and storage to signaling. Particularly abundant are the GTs that transfer a sugar residue from an activated nucleotide sugar donor to specific acceptor molecules, forming glycosidic bonds. Acceptors cover various chemical classes of compounds including carbohydrates, proteins, lipids, DNA and numerous small molecules such as antibiotics, flavonols, steroids, and so forth. In eukaryotes, most of the glycosylation reactions that generate the diversity of oligosaccharide struc- tures occur in the Golgi apparatus. In plants, building of the complex polysaccharide network found in the cell wall requires a large set of GTs because of the diversity in the linkage types connecting all monosaccharides within these polymers. How- ever, relatively few of these genes have been identified, and the molecular basis of cell wall biosynthesis remains poorly understood. 2 While cellulose (and also callose) is assembled at the plasma membrane and deposited directly into the wall, 3 other matrix polysaccharides are synthesized in the Golgi and delivered to the wall in secretory vesicles. 4 Golgi-resident GTs are typically type II membrane proteins with a large C-terminal globular catalytic domain facing the luminal side. 5,6 However, other protein architectures have been described for GTs, including type I membrane proteins (located in the endoplas- mic reticulum), and integral membrane proteins (i.e., cellulose synthases). 2 Plant UGTs that perform the numerous glycosy- lation reactions of secondary metabolites are cytosolic enzymes in contrast to their mammalian counterparts that are ER- localized type I membrane proteins. 7 GTs have been classified into families on the basis of amino acid sequence similarities available at http://www.cazy.org/. 8 At the present time (September 2008), the CAZy database comprises more than 34 000 confirmed and putative GT sequences that have been divided into 90 families (denoted as GTx). However, the number of families is continually increasing with the discovery and biochemical characterization of new GT genes. Approximately 450 Arabidopsis thaliana sequences have already been listed in this database, spread into 40 GT families. Since less than 20% of these genes are annotated, identifying the precise function of every putative plant GT represents a huge task. By comparison, the human genome has only 240 GT genes, of which more than 80% are annotated. The large number of GT sequences found in plant genomes can be explained by the complex structure of the plant cell wall, the large number of glycosylated secondary metabolites and also by genome duplication events. 9 Structural information is currently available for 30 GT families. In contrast to glycosylhydrolases that adopt a large variety of folds, including all R, all , or mixed R/ structures, GT folds have been observed to consist mainly of R//R sandwiches, 10 similar or very close to the Rossmann domain, a classical structural motif consisting of six-stranded parallel -sheet with 321456 topology and found in many nucleotide- binding proteins. 11 Until recently, only two structural super- * To whom correspondence should be addressed. Christelle Breton, Molecular Glycobiology, CERMAV-CNRS, BP53, F-38041 Grenoble cedex 9, France. Tel: +33 4 7603 7635. Fax: +33 4 7654 7203. E-mail: breton@cermav.cnrs.fr. † University of Grenoble. ‡ University of Lyon. § Masaryk University. 10.1021/pr800808m CCC: $40.75 2009 American Chemical Society Journal of Proteome Research 2009, 8, 743–753 743 Published on Web 12/16/2008