Construction of mathematical model for high-level expression of foreign genes in pPIC9 vector and its verification Bingli Wu a,b,1 , Lei Cha a,1 , Zepeng Du b,c , Xiaomin Ying a , Hua Li a , Liyan Xu d , Xiaofei Zheng e , Enmin Li b, * , Wuju Li a, * a Center of Computational Biology, Beijing Institute of Basic Medical Sciences, Beijing 100850, China b Department of Biochemistry and Molecular Biology, Medical College of Shantou University, Shantou 515041, China c Department of Biology, College of Science, Shantou University, Shantou 515041, China d Department of Pathology, Medical College of Shantou University, Shantou 515041, China e Beijing Institute of Radiation Medical Sciences, Beijing 100850, China Received 16 December 2006 Available online 12 January 2007 Abstract In this report, we introduced a mathematical model for high-level expression of foreign genes in pPIC9 vector. At first, we collected 40 heterologous genes expressed in pPIC9 vector, and these 40 genes were classified into high-level expression group (expression level >100 mg/L, 12 genes) and low-level expression group (expression level <100 mg/L, 28 genes). Then, the Naive Bayes method was used to construct the model with RNA secondary structure profile of 3 0 -end of foreign genes as features. The classification accuracy from leave-one-out cross-validation was 100%. Finally, another five genes collected from literatures were used to test the ability of the model. The results indicated that there were four genes correctly predicted. In addition, the model was also verified by expressing human neu- trophil gelatinase-associated lipocalin (NGAL) gene with expression level more than 100 mg/L. Therefore, we propose that the model can be used to predict the expression level of heterologous genes before experiments and optimize the experiment designs to obtain the high-level expression. Furthermore, we have developed a web server for evaluation and design for high-level expression of foreign genes, which is accessible at http://ppic9.med.stu.edu.cn/ppic9. Ó 2007 Elsevier Inc. All rights reserved. Keywords: RNA secondary structure; Protein expression; Pichia pastoris Comparing to the Escherichia coli expression system, the methylotrophic yeast Pichia pastoris expression system has many advantages [1–4]. For example, the expressed protein can be glycosylated and folded correctly, and these charac- teristics are especially important for the recombinant cytokines used in clinical patients. The P. pastoris expres- sion system has been extensively used since it was devel- oped, and many heterologous proteins have been expressed [1]. However, we still cannot predict the expres- sion level of heterologous genes before experiments because many factors are involved in affecting the expression level [4]. The factors include copy number of the expression cas- sette, the secondary structure of mRNA 5 0 - and 3 0 -untrans- lated regions (UTR), translational start codon (AUG) context, A + T composition of cDNA, nature of secretion signal, medium and growth conditions, fermentation parameters, vectors, and so on. These factors make the expression of heterologous protein in P. pastoris very com- plicated and it is very difficult to analysis the expression level quantitatively. Many papers have showed the mRNA structure corre- lates with the protein level [5–9]. But all of them only 0006-291X/$ - see front matter Ó 2007 Elsevier Inc. All rights reserved. doi:10.1016/j.bbrc.2007.01.002 Abbreviations: P. pastoris, Pichia pastoris; NGAL, neutrophil gelatin- ase-associated lipocalin; MFEM, minimum free energy matrix; UTR, untranslated region. * Corresponding authors. Fax: +86 010 68213039 (W. Li), +86 754 8900247 (E. Li). E-mail addresses: mnli@stu.edu.cn (E. Li), liwj@nic.bmi.ac.cn (W. Li). 1 These authors contributed equally to this work. www.elsevier.com/locate/ybbrc Biochemical and Biophysical Research Communications 354 (2007) 498–504