De novo Transcriptome Assembly of the halophyte Suaeda fru)cosa Joann Diray-Arce 1 , Mark Clement 2 , Bilquees Gul 3 , Ajmal Khan 3 and Brent L. Nielsen 1 1 Department of Microbiology and Molecular Biology, Brigham Young University, Provo, Utah 2 Department of Computer Science, Brigham Young University, Provo, Utah 3 Institute of Sustainable Halophyte Utilization, University of Karachi, Pakistan Abstract Major efforts to improve salt tolerance in agricultural crops have been a6empted to address problems with increasing soil salinity due to irriga:on. Conven:onal crops have been used to breed salt tolerance through gene:c engineering, in which genes for salt tolerance were introduced directly to plants. Most of these tradi:onal cash crops can only endure rela:vely low concentra:ons of salt; therefore, it may be reasonable to study and domes:cate a na:ve, salt tolerant plant. We sequenced a cDNA library using Illumina plaForm from shoots and roots of a perennial saltGtolerant plant Suaeda fru)cosa. The reads were quality assessed and filtered using soJwares FastX toolkit, Sickle and Trimmoma:c then digitally normalized to yield a total of 99,577,045 good quality pairedGend reads. We assembled the transcriptome of this species using OasesGVelvet and different reconstruc:on modes of Trinity. Our op:miza:on included CAP3, CDHIT and BLASTGisofuse using the generated assembly to further reconstruct the con:gs, removing redundancy and keeping the longest transcript to reduce number of replicated transcripts without affec:ng the percentage of reads mapping back to the assembly. The op:mum assembly created by Trinity is annotated based on molecular func:on, biological processes and cellular components. Acknowledgements I would like to sincerely thank Dr. Brent Nielsen for his mentorship, supervision and support with the project. I would like to acknowledge Dr. Bilquees Gul and Dr. Ajmal Khan and The Institute of Sustainable Halophyte Utilization for their expertise and assistance in the halophyte project, Dr. Mark Clement, Paul Bodily, Stanley Fujimoto and Justin Page for their help with computational analysis, to Nielsen Lab members, and to the US State Department for providing resources for the project. Conclusion Transcriptome analysis has covered a considerable proportion of the Suaeda fruticosa transcriptome. The optimization methods of clustering the assemblies suggest better ways of removing errors and redundancy while reducing the number of transcripts without affecting the percentage of the reads mapping back to the assembly. These data provide a genetic resource for the discovery of potential genes involved in salt tolerance in this species and may serve as a useful source as a reference sequence for closely related taxa. We have compared results and assessed the contigs based on predicted protein coding regions and on the percentage of reads mapped back to the assembly using GSNAP. The transcripts have been annotated according to their different molecular and biological categories. Our results show the most preferred de-novo assembly of the transcriptome of a non- conventional crop plants comparing no salt and optimal salt concentration growth conditions. This project aims to study the adaptation of halophytes to salt tolerance and identify genes that are involved in this response mechanism. Methods and Materials Plant Growth and Harvest: Seeds of S. fru)cosa were collected in Karachi, Pakistan, surface sterilized and germinated in the BYU greenhouse at a thermoperiod of 20 C: 30 C (night:day) and a 12Gh photoperiod. References Brown C Titus, Howe A, Zhang Q, Pyrkosz AB, and Brom TH. A Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data. arxiv.org 2012:1203.4802. Bullard J, Purdom E, Hansen K, Dudoit S: Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics 2010, 11(1):94. Glenn EP, Brown JJ, Blumwald E: Salt Tolerance and Crop Potential of Halophytes. Critical Rev Plant Sci 1999, 18(2):227-255. Glenn E, Brown JJ, O'Leary JW: Irrigating Crops with Seawater. In: Scientific American 1998. Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis X, Fan L, Raychowdhury R, Zeng Q et al: Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotech 2011, 29(7):644-652. Hameed A, Hussain T, Gulzar S, Aziz I, Gul B, Khan M: Salt tolerance of a cash crop halophyte Suaeda fruticosa: biochemical responses to salt and exogenous chemical treatments. Acta Physiol Plant 2012. Khan M, Ungar IA, Showalter AM: The effect of salinity on the growth, water status, and ion content of a leaf succulent perennial halophyte, Suaeda fruticosa (L.) Forssk. Journal of Arid Environments, 45(1):73-84. Schulz, M.H.; D.R. Zerbino, M. Vingron & E. Birney. Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels. Bioinformatics 2012, 28 (8): 1086-1092. Zerbino D.R., Birney E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 2008;18:821–829 Results Plant Growth RNA Isola:on cDNA library prep Illumina RNA sequencing Bioinforma:c Analyses Raw reads from Illumina Trimmoma:c FastX toolkit and Sickle Digital Normaliza:on Assemblers (OasesGVelvet, Trinity) Op:miza:on of Assembly (CAP3, CDHIT, BLASTGisofuse) GSNAP (alignment of reads) Downstream analysis (BLAST, Annota:on, Differen:al Expression) Table 2 shows the results of the clustering methods comparison of CAP3, CDHITGEST and BLASTGisofuse. CAP3 follows an overlap consensus alignment of the transcripts to link con:gs and correct assembly errors. CDHITGEST clusters similar DNA sequences according to a userGdefined similarity threshold and BLASTGisofuse selects the best hit according to EGvalue threshold of e G10 using local BLAST and then keeps the longest transcript to be used for downstream analyses. These methods will reduce the number of transcripts without affec:ng the efficiency of the assemblers. To op:mize the assembly procedures, further clustering and removing redundancy is performed using CAP3, CDHITGEST and BLASTGisofuse soJware on the top assemblies that have mapped back the most reads. We have selected OasesG45, TrinityGMerge and Trinity with three modes to use these different methods. Figure 6 shows the N50 value (computed by sor:ng the con:gs from largest to smallest then determining the minimum set whose sizes total 50% of the assembly) of both the transcripts produced by the assemblies and their open reading frames. Figure 5 shows the percentage of reads aligning back to the assembly. The most preferred assembly should have most of the reads aligning back to the assembled transcriptome produced by Oases kGmers 41 and 45 and Trinity assemblies. GSNAP (Genomic ShortGread Nucleo:de Alignment Program) was used for this alignment method. Figure 7 shows the percentage of transcripts aligning back to the reads using clustering methods of CAP3, CDHITGEST and BLASTG isofuse. CDHITGEST and our designed BLASTGisofuse have produced lower number of transcripts but similar percentage of transcripts aligned back to the assembly. These indicate that these methods have successfully removed redundancy and errors while reducing the number of transcripts that would be beneficial for annota:on and iden:fica:on. Reads PreparaBon Libraries Number of reads Total reads Raw Reads R000 95,248,764 335,271,656 S000 75,414,804 R300 84,162,958 S300 80,445,130 FastX Toolkit/ Trimmoma:c R000 68,444,064 292,898,120 S000 68,872,348 R300 79,313,812 S300 76,267,896 Sickle Trimmed All 283,587,292 Digital Normaliza:on All 99,577,045 78146 98333 108112 98456 67491 58984 68678 68669 47894 53907 44324 32902 31855 24187 20628 11289 2153 794 102 1 185867 220370 185978 212365 450588 298116 319830 273824 364339 303897 219714 152655 176964 111448 87188 77672 57426 40534 33508 17286 3801 1509 207 1 934896 974952 935336 1004011 G200000 0 200000 400000 600000 800000 1000000 1200000 Number of transcripts and ORFs Transcripts and open reading frames of each assembly open reading frames transcripts 53.43 61.92 72.91 72.61 67.10 62.69 66.43 64.75 58.04 59.41 54.85 49.28 46.75 39.97 35.91 24.40 8.38 9.09 6.97 0.01 76.64 76.83 76.67 42.41 G10.00 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 90.00 Percent Percentage of aligned reads at each kmer/assembler 661 1387 1519 1669 714 912 1329 1755 912 1719 1671 1328 1520 1150 1206 875 526 465 409 320 861 1109 862 940 828 1017 1047 1116 855 816 1161 1266 1076 1272 1254 1134 1155 1023 915 690 507 474 414 318 810 876 862 825 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Base pairs N50 values of transcript and ORF of each assembly transcript N50 ORF N50 Assembly # transcripts Length (bp) N50 (bp) # ORF N50 (bp) CAP3 Oases 45 26254 36253975 2016 14571 1194 Trinity Merge 68358 84290012 1711 33610 983 Trinity BuVerfly 10709 19147399 2404 9169 1179 Trinity Pasafly 8905 17640047 2719 8388 1185 Trinity Cufffly 10716 19169675 2407 9179 1179 CDHITZEST Oases 45 206999 188766870 1464 64581 1011 Trinity Merge 756976 453278702 862 145570 750 Trinity BuVerfly 752400 419039072 744 132413 726 Trinity Pasafly 730138 438462086 871 137499 759 Trinity Cufffly 752699 419284426 744 132497 726 BLASTZisofuse Oases 45 75215 64845846 1366 21690 1071 Trinity Merge 737535 426074382 796 133830 783 Trinity BuVerfly 487402 228360490 535 62314 687 Trinity Pasafly 484413 230501366 549 63718 675 Trinity Cufffly 487406 228396046 535 62333 687 Total annotaBon of GO MF: 55647 8251: other binding 7713: unknown funcBon 6623: transferase acBvity Total annotaBon of GO BP:130816 37439: cellular processes 27593: metabolic processes 8197: response to stress Total annotaBon of GO CC: 58501 9925: nucleus 8712: intracellular components 8700: cytoplasmic components 46.60 45.33 15.60 18.02 15.63 72.26 75.91 76.33 76.45 76.36 72.61 74.45 69.40 70.75 69.43 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 90.00 Oases 45 Trinity Merge Trinity Bu6erfly Trinity Pasafly Trinity Cufffly Oases 45 Trinity Merge Trinity Bu6erfly Trinity Pasafly Trinity Cufffly Oases 45 Trinity Merge Trinity Bu6erfly Trinity Pasafly Trinity Cufffly Percent CAP3 CDHITZEST BLASTZisofuse Percentage of aligned reads a]er clustering Figure 2: Schema:c diagram for RNA sequencing procedures Figure 1a and 1b: Halophyte species Suaeda fru)cosa and three independent biological replicates of shoots and roots used for sequencing A B A B C Figure 3a: SoJware used for processing de novo assembly and downstream analysis Figure 3b shows the plot of total reads vs kept read pairs aJer digital normaliza:on Figure 3c shows the kGmer distribu:on plot of raw and normalized data of the RNAGseq libraries Table 1: Sta:s:cs of reads aJer using reads and adapter trimmer and digital normaliza:on method Figure 4 summarizes the number of transcripts and open reading frames (ORF) produced by each assembly. Trinity assemblies produced higher number of transcripts and open reading frames compared to velvetGoases assemblies. Optimization methods of clustering and redundancy removal Gene annotation using BLAST2Go A. Molecular Function B. Biological Processes other binding 15% unknown molecular funcBons 14% transferase acBvity 12% other enzyme acBvity 10% hydrolase acBvity 9% nucleoBde binding 7% DNA or RNA binding 7% kinase acBvity 7% protein binding 5% transporter acBvity 5% transcripBon factor acBvity 3% nucleic acid binding 3% other molecular funcBons 2% structural molecule acBvity 1% receptor binding or acBvity 0% nucleus 17% other intracellular components 15% other cytoplasmic components 15% chloroplast 10% other membranes 9% mitochondria 6% plasma membrane 6% extracellular 5% plasBd 4% cytosol 3% unknown cellular components 3% other cellular components 2% ribosome 2% Golgi apparatus 2% cell wall 1% ER 1% other cellular processes 29% other metabolic processes 21% response to stress 6% unknown biological processes 6% developmental processes 6% response to abioBc or bioBc sBmulus 6% protein metabolism 5% transport 5% cell organizaBon and biogenesis 5% other biological processes 4% signal transducBon 3% transcripBon,DNAZ dependent 2% DNA or RNA metabolism 1% electron transport or energy pathways 1% C. Cellular Components Figure 8: Gene annota:ons of transcripts from the op:mized deGnovo assembly were performed and categorized according to (a) Molecular Func:on (b) Biological Processes and (c) Cellular components using BLAST2Go with NCBI nonGredundant database set at threshold of e G10 . Bioinformatics Preparation and Analyses View publication stats View publication stats