Gene identification signature (GIS) analysis for transcriptome characterization and genome annotation Patrick Ng 1,3 , Chia-Lin Wei 1,3 , Wing-Kin Sung 1 , Kuo Ping Chiu 1 , Leonard Lipovich 1 , Chin Chin Ang 1 , Sanjay Gupta 1 , Atif Shahab 2 , Azmi Ridwan 2 , Chee Hong Wong 2 , Edison T Liu 1 & Yijun Ruan 1 We have developed a DNA tag sequencing and mapping strategy called gene identification signature (GIS) analysis, in which 5¢ and 3¢ signatures of full-length cDNAs are accurately extracted into paired-end ditags (PETs) that are concatenated for efficient sequencing and mapped to genome sequences to demarcate the transcription boundaries of every gene. GIS analysis is potentially 30-fold more efficient than standard cDNA sequencing approaches for transcriptome characterization. We demonstrated this approach with 116,252 PET sequences derived from mouse embryonic stem cells. Initial analysis of this dataset identified hundreds of previously uncharacterized transcripts, including alternative transcripts of known genes. We also uncovered several intergenically spliced and unusual fusion transcripts, one of which was confirmed as a trans-splicing event and was differentially expressed. The concept of paired-end ditagging described here for transcriptome analysis can also be applied to whole-genome analysis of cis-regulatory and other DNA elements and represents an important technological advance for genome annotation. With the completion of sequencing of the human 1–3 and other mammalian genomes 4,5 , scientists have turned their attention to the annotation of genomes for functional content, including gene- coding transcription units and cis-acting regulatory and epigenetic elements that modulate gene expression 6 . Current approaches to genome annotation include the use of cDNA 7 and microarray data 8,9 as well as ab initio computer predictions 10,11 and compar- ison of different vertebrate genomes to identify evolutionarily conserved regions 12,13 . Despite considerable success, there are limitations to the current transcript-targeted approaches. Fundamentally, there is no method that can rapidly, efficiently and accurately characterize entire transcriptomes across a large number of cell samples and biological conditions (reviewed in ref. 14). The full-length cDNA (flcDNA) sequencing approach 15,16 provides substantial information, but it is labor-intensive and too costly for the in-depth analysis of multiple transcriptomes. cDNA short-tag strategies, such as serial analysis of gene expression (SAGE) 17,18 and massively parallel signature sequencing (MPSS) 19 , can be used to efficiently quantify known transcripts but provide only limited information about transcript structure. To address these problems, we developed an approach that combines the efficiency of short-tag methods with the accuracy provided by flcDNA characterization, to exploit the information contained in assembled genome sequences. The core concept is to obtain only linked 5¢ and 3¢ short tag sequences for each transcript, map these terminal ‘signatures’ to the genome and thereby infer the complete transcription units by the genome sequence encompassed between these 5¢ and 3¢ signatures. RESULTS Construction of GIS paired-end ditags As an interim procedure we developed the 5¢ LongSAGE and 3¢ LongSAGE protocols that extracted 20 base pair (bp) 5¢ and 3¢ terminal tags separately 20 . With this new capability, we pro- ceeded to design a cloning strategy that would covalently link the 5¢ and 3¢ signatures of each full-length transcript into a ditag structure (Fig. 1). Such PETs representing individual transcripts would then be concatenated for cloning and high-throughput sequencing. A quality single-pass sequencing read (B700 bp) would, on average, enable the analysis of about 15 such PET sequences. The PET sequences were then mapped directly to the genome to define the transcription start sites and polyadenylation sites of individual transcripts. To demonstrate this strategy, we generated 116,252 PETs that represented 63,467 nonredundant PET sequences from the E14 mouse embryonic stem cell line. Quality and mapping specificity of ditags A typical PET structure should contain an 18-nucleotide (nt) 5¢ signature (positions 1–18) and an 18-nt 3¢ signature (position 19–36) including a residual AA dinucleotide derived from the mRNA poly(A) tail that indicates ditag orientation (Supplementary Fig. 1 online). The PET sequences were mapped to the mouse genome assembly (mm3; http://hgdownload.cse.ucsc. edu/goldenPath/mmFeb2003/chromosomes) by a suffix tree– derived alignment algorithm (W.-K.S. et al., unpublished data). When mapped correctly to the genome sequences, nucleotides 1–18 in a ditag sequence should be aligned with the 5¢ boundary and nucleotides 19–34 with the 3¢ boundary of the corresponding PUBLISHED ONLINE 9 JANUARY 2005; DOI:10.1038/NMETH733 1 Genome Institute of Singapore, 60 Biopolis Street, Genome #02-01, Singapore 138672. 2 Bioinformatics Institute, 30 Biopolis Street, Matrix #08-01, Singapore 138671. 3 These authors contributed equally to this work. Correspondence should be addressed to Y.R. (ruanyj@gis.a-star.edu.sg) and E.T.L. (liue@gis.a-star.edu.sg). NATURE METHODS | VOL.2 NO.2 | FEBRUARY 2005 | 105 ARTICLES