International Journal of Computer Applications (0975 8887) Volume 127 No.16, October 2015 37 The Impact of Transformed Features in Automating the Swahili Document Classification Thomas Tesha College of Informatics and Virtual Education (CIVE), University of Dodoma (UDOM), P.O. BOX 490, Dodoma, Tanzania ABSTRACT This paper describes experimental results in an attempt to identify the Transformation techniques which can be adopted to improve features for the automation of classification of Swahili documents. This means improving classification rate by enhancing separability and accuracy. The experiment involved Relative Frequency (RF), Power transformation (PT) and Relative Frequency with Power transformation (RFPT). The Term weighting with TFIDF and the absolute features (AF) were also studied. The features’ dimension reduction was done by using the statistical techniques of Principal Component Analysis. In learning algorithm, the Support vector machine for classification and the k-NN were used, and in evaluating the effect of featuresperformance with the classifiers the micro averaged f-measure were adopted. The extensive experimental results demonstrated that the RFPT features worked better with the Support Vector Machine classifiers unlike k-NN in improving the classification rate by enhancing document separability and accuracy in Automation of Swahili document classification. Keywords Machine learning algorithm, Support vector machine, Swahili, Swahili document classification 1. INTRODUCTION Studies on automated text classification have been one of the major and current research areas in natural language processing. It is defined as the task of automatically assigning a set of documents into appropriate categories according to their content or topic [1], [15]. The basic process is learning a classification scheme from training examples to build a model to be used in classifying unseen textual documents [1]. In fact with the status of the evolution of technology (Internet and the supporting IT infrastructure both software and the hardware) massive data in terms of text, audio, video and electronics pictures etc. (big data) are generated. However generation of these digital data is not a problem but the question that rises is how to organize and manage these piles of documents for the easy retrieval and analysis. In fact Swahili documents are among the plenty digital documents with much information not only in the Internet but also in all areas of technology including libraries as well, and if properly retrieved with easy and precise, they can solve various problems [2]. Social networks media like Facebook, Twitter, LinkedIn, Skype etc. also generate massive Swahili data all the time. So are offices, governments and organizations as well as researches. These digital documents can be harnessed and automatically organized using automated classifiers [2]. Studies on automated text classification [3- 5] have been done extensively using other languages than Swahili and the approach to improve features on Automating Text Classification in such language is done. In this work, the focus is on the experimental study of feature transformation techniques which can improve classification rate in the automation of classification Swahili documents (text). The transformation techniques involved were RF, PT and RFPT. The experiments were also done with the Term weighting with TFIDF. In conducting this study, the selected machine learning algorithm intuitively the Support vector machine for classification (SVM) and k-nearest-neighbor (k- NN) were adopted. To understand this work and the method involved, a familiarity with SVM for classification is required and a brief introduction follows. The subsequent two sections give a brief introduction on kernels and k-NN. 2. SUPPORT VECTOR MACHINE FOR CLASSIFICATION The SVM [14-16] is a type of learning algorithm which was in its inception introduced largely by [16] and his colleagues and extended by the number researchers to its maturity. In fact the SVMs have proven to work successfully to many application including text classification, Optical character recognition, image processing etc. They separate a given set of binary labeled training data with a hyper-plane that is maximally distant from them (known as ‘the maximal margin hyper-plane’) a technique termed as optimal margin classifier. Although it involves the use of the hyperplane, It doesn’t mean it handles only linear separable data, however a case of nonlinear, the nonlinearity is handled by a technique of kernels which maps the input space into high dimensional feature space which is linear separable. The kernels are discussed in brief in the following section. The hyper-plane found by the SVM in feature space corresponds to a nonlinear decision boundary in the input space. Let’s consider a case of a binary classification problem. In this case we take the input vector ={   } which for a case of Swahili documents, represents a document say  in a sample of labeled documents say          and this input corresponds to label   . Furthermore, by assuming all data is at least distance 1 from the separating hyperplane where for a case of support vectors the inequality becomes equality as data points touches the margin see (1) and (2) it is easy to show that the margin can be given by   .  if  ……………………..(1)  if  ……………………..(2)