Performance Improvement Of Support Vector Machine (SVM) With Information Gain On Categorization Of Indonesian News Documents Adhy Rizaldy 1 , Heru Agus Santoso 2 1,2 Researcher at Computer Science Faculty of Dian Nuswantoro University Jln. Imam Bonjol No.207, Semarang Correspond: adhyc4n@gmail.com, heru.agus.santoso@dsn.udinus.ac.id Abstraction More news articles which are unstoppable increasing, causing problems with grouping news according to appropriate kind of label. Therefore it is necessary to deal with the problem of grouping news by it’s category like business news, political news, and sports news. The categorization of news document belong to text classification domain, a Machine Learning topic as an approach that addressed this problem. Various algorithms have been used in previous studies such as Bayesian techniques, k-Nearest Neighborhood, Neural Networks, and Support Vector Machine (SVM). This study provides an understanding of the SVM method for news categorization on Indonesian news dataset that contain several types of news category. Problems in text classification is the number of features that affecting classification performance with SVM. Use of Information Gain as feature selection improve accuracy than without any feature selection. Our model give satisfying result with 98,057 % accuracy of Indonesia news classification. Improvement 2,9 points from 95,11% by SVM technique without feature selection. Keywords: news document classification, text categorization, SVM classification, Information Gain, Indonesia news classification I. INTRODUCTION Online news as a main source of daily information in our country, have been increasing significantly. Some of news portal like kompas.com and detik.com become favorite and accessed by citizens continuously. This made they added subdomains as some new category to spread different kind of readers. Some of that like hot.detik.com, kompasiana of kompas, wolipop of detik.com and many others. These categorization is somehow can be wrong overtime caused by the hugely articles published. We found some of these, for example entertainment news of kompas.com. Many articles on 2008 about economy and government figures had placed in 'entertainment.kompas.com' domain. Another case in detik.com, some news about ‘olahraga’ topic on 2010 had placed in ‘nasional’ category. One factor of this problem could be human error. In order to minimize this problem, the media stakeholder need technique to manage the archived of news files well. Some research has done in text classification named document classification to occupy this. Document classification problem for Indonesia news have used many approach. Jaafar and partners classified Indonesia and Malaysia news from two webportals of each country with kNN based on technique [1]. Even though this Neighborhood algorithm, suffered on performance when training data is quite big. But when not big enough isn't going to be optimal [1]. Asy'arie and partner used Naive Bayes (NB) to 250 Indonesia news articles [2]. NB kind of sensitive to the amount of training document data and had drawn of performance problem [2]. Some papers have implemented Support Vector Machine as main classifier. Lilliana and partner [3] did classification with SVM on 180 news articles of kompas.com. Their method produced 85% average accuracy. Document classification problem for Indonesia news used SVM algorithm in Khodra research [4]. They could handle 10.404 articles for categorization with satisfied results. They compared several techniques for automatic classification with multilabel based on approach. By SVMs as the binary classifier, this research had 78% of F-measure in result. From these papers [3, 4], the amount of dataset used is quite different, but decreased in result, although both generated by SVMs classifier. From this gap we could conclude that general problem of text classification is big dimension of data. In other words, a lot of features conducted from text dataset can cut down the result. In order to fix that problem, implementation of SVM technique for automatic text classification needs dimension data reduction. In this work we discuss about feature selection as one way of dimensional reduction of massive data to improve machine learning method. Some feature selection have implemented. Information Gain had used in [5][6] with significantly 2017 International Seminar on Application for Technology of Information and Communication (iSemantic) 227