Income Prediction via Support Vector Machine Alina Lazar Computer Science and Information Systems Department Youngstown State University Youngstown, OH 44555 alazar@cis.ysu.edu Abstract Principal component analysis and support vector machine methods are employed to generate and evaluate income prediction data based on the Current Population Survey provided by the U.S. Census Bureau. A detailed statistical study targeted for relevant feature selection is found to increase efficiency and even improve classification accuracy. A systematic study is performed on the influence of this statistical narrowing on the grid parameter search, training time, accuracy, and number of support vectors. Accuracy values as high as 84%, when compared against a test population, are obtained with a reduced set of parameters while the computational time is reduced by 60%. Tailoring computational methods around specific real data sets is critical in designing powerful algorithms. Keywords: Support vector machine, principal component analysis, classification, accuracy, ROC curve. 1 Introduction Supervised learning methods based on statistical learning theory, for classification and regression, provide good generalization and classification accuracy on real data. However, their inherent trade-off is their computational expense. Recently, support vector machines (SVM) [1]-[4] have become a popular tool for learning methods since they translate the input data into a larger feature space where the instances are linear separable, thus increasing efficiency. In SVM methods a kernel which can be considered a similarity measure is used to recode the input data. The kernel is used accompanied by a map function Φ. Even if the mathematics behind the SVM is straight forward, finding the best choices for the kernel function and parameters can be challenging, when applied to real data sets. Usually, the recommended kernel function for nonlinear problems is the Gaussian radial basis function, because it resembles the sigmoid kernel for certain parameters and it requires less parameters than a polynomial kernel. The kernel function parameter γ and the parameter C, which control the complexity of the decision function versus the training error minimization, can be determined by running a 2 dimensional grid search, which means that the values for pairs of parameters ( C, γ) are generated in a predefined interval with a fixed step. The performance of each combination is computed and used to determine the best pair of parameters. However, due to memory limitations and the quadratical grow of the kernel with the number of training examples, it is not practical to grid search for SVM’s parameters by using datasets with more than 10 3 data instances. Also the non-sparse property of the solution leads to a really slow evaluation process. Thus, for larger datasets only a randomly selected subset of training instances is used for the grid search. A supplementary data reduction [5] can be done in terms of variables or features of the data set considered. Redundant or highly correlated features can be replaced with a smaller uncorrelated number of features capturing the entire information. This can be done by applying a method called Principal Component Analysis (PCA) before using the SVM algorithm. The experiments presented in this paper used the Current Population Survey (CPS) database provided by the U.S. Census Bureau [6]. The CPS survey was conducted for more than 50 years and collects information about the social, demographic and economic characteristics of the labor force 16 years and older of the U.S. population. The data collected each month is used to compute reports about employment, unemployment, and earnings. It also includes statistics about various social factors from voting to smoking. The government policymakers and legislators use the statistics generated from the CPS data as indicators about the economic and social situation and for the planning and evaluation of many government programs. The data is publicly available and free of charge, fact that encouraged its use in various social and economic studies [7], [8]. Due to the large number of variables included and its implicit large size it is farfetched to believe that its entire value has been fully exploited. This has motivated its active use by the machine learning and knowledge discovery communities, as a platform for testing various data mining methods including, neural networks, nearest neighbor, decision tree and lately support vector machines.