Hindawi Publishing Corporation BioMed Research International Volume 2013, Article ID 248648, 9 pages http://dx.doi.org/10.1155/2013/248648 Research Article Multiclass Prediction with Partial Least Square Regression for Gene Expression Data: Applications in Breast Cancer Intrinsic Taxonomy Chi-Cheng Huang, 1,2,3,4 Shih-Hsin Tu, 4,5 Ching-Shui Huang, 4,5 Heng-Hui Lien, 3,5 Liang-Chuan Lai, 6 and Eric Y. Chuang 1 1 Graduate Institute of Biomedical Electronics and Bioinformatics, National Taiwan University, No. 1, Section 4, Roosevelt Road, Taipei 10617, Taiwan 2 Cathay General Hospital SiJhih, New Taipei, Taiwan 3 School of Medicine, Fu-Jen Catholic University, New Taipei, Taiwan 4 School of Medicine, Taipei Medical University, Taipei, Taiwan 5 Department of Surgery, Cathay General Hospital, Taipei, Taiwan 6 Graduate Institute of Physiology, National Taiwan University, Taipei City, Taiwan Correspondence should be addressed to Eric Y. Chuang; chuangey@ntu.edu.tw Received 24 October 2013; Accepted 23 November 2013 Academic Editor: Koichi Handa Copyright © 2013 Chi-Cheng Huang et al. his is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Multiclass prediction remains an obstacle for high-throughput data analysis such as microarray gene expression proiles. Despite recent advancements in machine learning and bioinformatics, most classiication tools were limited to the applications of binary responses. Our aim was to apply partial least square (PLS) regression for breast cancer intrinsic taxonomy, of which ive distinct molecular subtypes were identiied. he PAM50 signature genes were used as predictive variables in PLS analysis, and the latent gene component scores were used in binary logistic regression for each molecular subtype. he 139 prototypical arrays for PAM50 development were used as training dataset, and three independent microarray studies with Han Chinese origin were used for independent validation ( = 535). he agreement between PAM50 centroid-based single sample prediction (SSP) and PLS- regression was excellent (weighted Kappa: 0.988) within the training samples, but deteriorated substantially in independent samples, which could attribute to much more unclassiied samples by PLS-regression. If these unclassiied samples were removed, the agreement between PAM50 SSP and PLS-regression improved enormously (weighted Kappa: 0.829 as opposed to 0.541 when unclassiied samples were analyzed). Our study ascertained the feasibility of PLS-regression in multi-class prediction, and distinct clinical presentations and prognostic discrepancies were observed across breast cancer molecular subtypes. 1. Introduction Multi-class prediction remains a challenge for high-through- put bioinformatics such as analysis of microarray gene expression data. Numerous machine learning algorithms are readily available for high-throughput data analysis, most of which, however, are limited to scenarios of the clas- siication or prediction with only two classes. his dii- culty arises not only from the vast data amount produced by high-throughput microarray or sequencing experiments but from the highly-correlated and nonstochastic nature of genetic/gene expression data. For real-world applications, dichotomous classiications between cancer/normal, alive/ dead, and responsive/resistant status are mostly encountered, and many machine learning algorithms and bioinformatics tools perform quite well with suicient discriminative power [1–3]. One way to tackle the  (experimental samples) < (genomic/gene expression features) problem inherited in high-throughput microarray or sequencing techniques is to