Mining the optimal clustering of people’s characteristics of health care choices Chieh-Yu Liu a,b, * , Jih-Shin Liu c a Department of Nursing, National Taipei College of Nursing, 365, Min-der Rd., Bei-Tou District, Taipei City 112, Taiwan, ROC b Department of Health Care Management, National Taipei College of Nursing, 365, Min-der Rd., Bei-Tou District, Taipei City 112, Taiwan, ROC c Division of Biostatistics and Bioinformatics, National Health Research Institutes (NHRI), 35, Keyan Rd., Zhunan Town, Mioli County 350, Taiwan, ROC article info Keywords: Health care k-Means cluster analysis v-Fold cross-validation Multiple Correspondence Analysis abstract In Asian countries, there has been a multi-choice healthcare environment for many years. In Taiwan, peo- ple’s multiple health care seeking behavior has resulted in much heavier ﬁnancial burden of National Health Insurance Program (NHIP) in recent years: investigating the characteristics of people who use multiple health care resources has gained increasing importance for health authorities. In this study, we investigated the socioeconomic and demographic characteristics which underlined people’s choice of health care by using a population representative database. A novel methodology which incorporated k-means cluster analysis with v-fold cross-validation into Multiple Correspondence Analysis (MCA) is proposed. This novel methodology can help us to ﬁnd the optimal attribute clustering of multiple health care utilization. By using this methodology, researchers not only can avoid the ambiguities of identifying clusters resulted from the traditional hierarchical cluster analysis (HCA), but also can provide more solid and evidence-based analysis for health policy making. Crown Copyright Ó 2010 Published by Elsevier Ltd. All rights reserved. 1. Introduction In Asian countries, there has been a multi-choice environment of health care for a long time. Although the prevalence of or why people used complementary and alternative medicine (CAM) has been extensively studied in western countries in recent decades (Astin, 1998; Barnes, Powell-Griner, McFann, & Nahin, 2004; Eisen- berg et al., 1993; Honda & Jacobson, 2005). In Asian countries, this has received much less attention. In Taiwan, the ﬁnancial burden of National Health Insurance Program (NHIP) became much heavier in recent years so investigating the characteristics of people who consumed or utilized more NHI resources has become a very important health policy-making topic and investigating the socio- economic and demographic factors underlying people’s choice behavior among different health care gains increasing importance. Liu, Chen, Weng, and Liu (2006) ever used Multiple Correspon- dence Analysis (MCA) and traditional hierarchical cluster analysis (HCA) to identify the attribute clusters of people’s characteristic among different health care choices. Although such methodology has been popularized by social science researchers (Guinot et al., 2001; Milani, Cortinovis, Rainisio, Fognini, & Marubini, 1983; Minoura, 1987; Roux et al., 1995), however, this methodology could result in some drawbacks: (1) as the number of study sub- jects increases, it is more difﬁcult to identify clusters from a tree diagram derived from HCA; and (2) researchers identify the result- ing clusters from traditional HCA according to their subjective interpretation. These drawbacks cast doubts on the validity of this method. Correspondence analysis (CA) has been widely applied for ana- lyzing questionnaire data, especially for those with multiple choice questions (Greenacre, 1984). CA can be classiﬁed into Simple Cor- respondence Analysis (SCA) and Multiple Correspondence Analysis (MCA). SCA is a descriptive or exploratory technique designed to analyze two-way tables containing some measurements of corre- spondence between the rows and columns. The SCA results provide information that is similar in nature to those derived from Factor Analysis techniques, which allows one to explore the structure of categorical variables included in the table or space. MCA is the gen- eralized extension of SCA to deal with more than two variables. Interpretation of MCA’s results is similar to those for SCA; but the increased complexity of the data requires an added dimension to any analytical process. The purpose of this paper is to propose a novel methodology that incorporates k-means cluster analysis with v-fold cross-validation into MCA for determining the optimal clustering of people’s characteristics among different health care choices. This algorithm consists of three parts: (1) obtaining a set of coordinate values summarizing respondents’ demographic and socioeconomic characteristics associated with their health care choices, allowing associations between demographic and socioeco- nomic characteristics and health care choices to be displayed graphically; (2) using k-means cluster method (Everitt, 1993) to 0957-4174/$ - see front matter Crown Copyright Ó 2010 Published by Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2010.07.044 * Corresponding author at: Department of Nursing, National Taipei College of Nursing, 365, Min-der Rd., Bei-Tou District, Taipei City 112, Taiwan, ROC. Tel.: +886 2 2822 7101x3312; fax: +886 2 2821 3233. E-mail address: chiehyu@ntcn.edu.tw (C.-Y. Liu). Expert Systems with Applications 38 (2011) 1400–1404 Contents lists available at ScienceDirect Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa