Clinical charge profiles prediction for patients diagnosed with chronic diseases using Multi-level Support Vector Machine Wei Zhong a,⇑ , Rick Chow a , Jieyue He b a Division of Mathematics and Computer Science, University of South Carolina Upstate, SC 29303, USA b School of Computer Science and Engineering, Southeast University, Nanjing 210096, China article info Keywords: Support Vector Machine Classification problem Multi-level clustering algorithm Chronic disease and parallel algorithm abstract This research utilizes the national Healthcare Cost & Utilization Project (HCUP-3) databases to construct Support Vector Machine (SVM) classifiers to predict clinical charge profiles, including hospital charges and length of stay (LOS), for patients diagnosed with heart and circulatory disease, diabetes and cancer, respectively. Clinical charge profiles predictions can provides relevant clinical knowledge for healthcare policy makers to effectively manage healthcare services and costs at the national, state, and local levels. Despite its solid mathematical foundation and promising experimental results, SVM is not favorable for large-scale data mining tasks since its training time complexity is at least quadratic to the number of samples. Furthermore, traditional SVM classification algorithms cannot build an effective SVM when dif- ferent data distribution patterns are intermingled in a large dataset. In order to enhance SVM training for large, complex and noisy healthcare datasets, we propose the Multi-level Support Vector Machine (MLSVM) that organizes the dataset as clusters in a tree to produce better partitions for more effective SVM classification. The MLSVM model utilizes multiple SVMs, each of which learns the local data distri- bution patterns in a cluster efficiently. A decision fusion algorithm is used to generate an effective global decision that incorporates local SVM decisions at different levels of the tree. Consequently, MLSVM can handle complex and often conflicting data distributions in large datasets more effectively than the sin- gle-SVM based approaches and the multiple SVM systems. Both the combined 5 2-fold cross validation F test and the independent test show that classification performance of MLSVM is much superior to that of a CVM, ACSVM and CSVM based on three popular performance evaluation metrics. In this work, CSVM and MLSVM are parallelized to speed up the slow SVM training process for very large and complex data- sets. Running time analysis shows that MLSVM can accelerate SVM’s training process noticeably when the parallel algorithm is employed. Ó 2011 Elsevier Ltd. All rights reserved. 1. Introduction Chronic diseases are among the leading causes of disability and death in the United States. This project focuses on the three most prevalent chronic diseases: heart and circulatory disease, diabetes and cancer. Chronic diseases account for 70% of deaths and approx- imately 78% of total healthcare spending. Despite dramatic improvements in therapies and treatments, the rate of chronic dis- eases has risen dramatically. The rising rate of chronic diseases is a crucial but frequently ignored contributor to rising medical expen- ditures. Current strategies to address the escalating costs in health- care for chronic diseases are based on small and localized data sets. Healthcare models developed from such localized data sets are used by individual healthcare system to compare costs and to ap- ply cost avoidance/reduction protocols. Typically, only local bench- marks are used in these models, reducing their applicability to the larger and more general population (Breault, Goodall, & Fos, 2002). These localized approaches for predicting comprehensive costs and outcomes within a single healthcare system often fail to produce valid and robust results at the national level. In contrast, this re- search utilizes the national Healthcare Cost & Utilization Project (HCUP-3) databases (http://www.ahrq.gov/data/hcup/#hcup) to construct Support Vector Machine (SVM) (Vapnik, 1998) classifiers to predict clinical charge profiles, including hospital charges and length of stay (LOS), for patients diagnosed with heart disease, dia- betes and cancer respectively. Prediction results generated from this research can provide relevant clinical knowledge for health- care policy makers to effectively manage healthcare services and costs at the national, state and local levels. SVM (Vapnik, 1998) has shown superior classification perfor- mance in various bioinformatics applications as compared to other classifiers. Despite its solid mathematical foundation and promis- ing experimental results, SVM is not favorable for large-scale data 0957-4174/$ - see front matter Ó 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2011.08.036 ⇑ Corresponding author. Tel.: +1 864 503 5785. E-mail address: wzhong@uscupstate.edu (W. Zhong). Expert Systems with Applications 39 (2012) 1474–1483 Contents lists available at SciVerse ScienceDirect Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa