Hindawi Publishing Corporation
BioMed Research International
Volume 2013, Article ID 248648, 9 pages
http://dx.doi.org/10.1155/2013/248648
Research Article
Multiclass Prediction with Partial Least Square
Regression for Gene Expression Data: Applications in
Breast Cancer Intrinsic Taxonomy
Chi-Cheng Huang,
1,2,3,4
Shih-Hsin Tu,
4,5
Ching-Shui Huang,
4,5
Heng-Hui Lien,
3,5
Liang-Chuan Lai,
6
and Eric Y. Chuang
1
1
Graduate Institute of Biomedical Electronics and Bioinformatics, National Taiwan University, No. 1, Section 4,
Roosevelt Road, Taipei 10617, Taiwan
2
Cathay General Hospital SiJhih, New Taipei, Taiwan
3
School of Medicine, Fu-Jen Catholic University, New Taipei, Taiwan
4
School of Medicine, Taipei Medical University, Taipei, Taiwan
5
Department of Surgery, Cathay General Hospital, Taipei, Taiwan
6
Graduate Institute of Physiology, National Taiwan University, Taipei City, Taiwan
Correspondence should be addressed to Eric Y. Chuang; chuangey@ntu.edu.tw
Received 24 October 2013; Accepted 23 November 2013
Academic Editor: Koichi Handa
Copyright © 2013 Chi-Cheng Huang et al. his is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.
Multiclass prediction remains an obstacle for high-throughput data analysis such as microarray gene expression proiles. Despite
recent advancements in machine learning and bioinformatics, most classiication tools were limited to the applications of binary
responses. Our aim was to apply partial least square (PLS) regression for breast cancer intrinsic taxonomy, of which ive distinct
molecular subtypes were identiied. he PAM50 signature genes were used as predictive variables in PLS analysis, and the latent
gene component scores were used in binary logistic regression for each molecular subtype. he 139 prototypical arrays for PAM50
development were used as training dataset, and three independent microarray studies with Han Chinese origin were used for
independent validation ( = 535). he agreement between PAM50 centroid-based single sample prediction (SSP) and PLS-
regression was excellent (weighted Kappa: 0.988) within the training samples, but deteriorated substantially in independent samples,
which could attribute to much more unclassiied samples by PLS-regression. If these unclassiied samples were removed, the
agreement between PAM50 SSP and PLS-regression improved enormously (weighted Kappa: 0.829 as opposed to 0.541 when
unclassiied samples were analyzed). Our study ascertained the feasibility of PLS-regression in multi-class prediction, and distinct
clinical presentations and prognostic discrepancies were observed across breast cancer molecular subtypes.
1. Introduction
Multi-class prediction remains a challenge for high-through-
put bioinformatics such as analysis of microarray gene
expression data. Numerous machine learning algorithms are
readily available for high-throughput data analysis, most
of which, however, are limited to scenarios of the clas-
siication or prediction with only two classes. his dii-
culty arises not only from the vast data amount produced
by high-throughput microarray or sequencing experiments
but from the highly-correlated and nonstochastic nature of
genetic/gene expression data. For real-world applications,
dichotomous classiications between cancer/normal, alive/
dead, and responsive/resistant status are mostly encountered,
and many machine learning algorithms and bioinformatics
tools perform quite well with suicient discriminative power
[1–3].
One way to tackle the (experimental samples) <
(genomic/gene expression features) problem inherited in
high-throughput microarray or sequencing techniques is to