International Journal of Computing and Digital Systems ISSN (2210-142X) Int. J. Com. Dig. Sys. 9, No. 6 (Nov-2020) E-mails: hemrajsingh.gheeseewan@umail.uom.ac.mu, s.pudaruth@uom.ac.mu http://journals.uob.edu.bh Categorisation of Computer Science Research Papers using Supervised Machine Learning Techniques Hemrajsingh Gheeseewan 1 and Sameerchand Pudaruth 1 1 Department of Information and Communication Technologies, University of Mauritius, Mauritius Received 21 Mar. 2020, Revised 14 May. 2020, Accepted 1 Aug. 2020, Published 1 Nov. 2020 Abstract: In this modern era of bleeding-edge technologies, information creation, sharing and consumption are rising at an exponential rate. In the same vein, there has been a continued increase in the amount of research is are being published worldwide and a large proportion of them are in the computer science field. There is an urgent need to provide some level of order in this huge jungle of data. Thus, in this article, we have used eight supervised machine learning techniques to classify computer science research papers. Machine learning techniques, such as logistic regression, multinomial naive bayes, gaussian naive bayes, support vector machines, k-nearest neighbours, decision tree, random forest and deep learning neural networks were trained to classify research papers into appropriate categories. For this purpose, a labelled dataset of 69776 papers was downloaded from arXiv and these were classified into 35 categories. The best f1-score of 0.60 was obtained by the logistic regression classifier. It was also the fastest machine learning classifier. The best f1-score from the deep learning network was 0.59. Using only the list of references for classification produced an f1-score of 0.57, but the training and testing time was significantly less. This shows that it is possible to use only references to classify computer science research papers. The f1-score for abstracts only was 0.52. Computer science papers often do not fall into neat categories. They are often multi-topical. Thus, in the future, we intend to perform multi-label classification on the same dataset. Keywords: Document Classification, Computer Science, Machine Learning, Logistic Regression, Deep Learning 1. INTRODUCTION The continued and relentless digitisation of the society has led to a massive increase in the volume of data that are being produced and this is increasing exponentially year after year. Such data can generally be categorised as either structured or unstructured data. Structured data are data that has a fixed format and are usually stored in electronic databases. Such databases can be easily queried to get relevant information. Structured data requires simple and straightforward search algorithms to be retrieved due to its predictable structure [1]. On the other hand, unstructured data can be generated by humans through text files, emails, social media posts, satellite footage, surveillance footage, but also from sensors [2][3]. White et al. showed that the number of academic publications worldwide almost doubled from 1.3 million to 2.3 million from 2004 to 2014 [4]. The United States of America (USA) and China are at the top of the list with 19% and 17% of the world’s total, respectively [4]. Publications in the field of Computer Science are ranked fifth. They account for 8.9% of all research publications. Hänig et al. stipulated that new text mining techniques must be developed to extract intelligence, share information and deliver value from unstructured data as these data cannot be analysed, visualised or sorted in the same way that structured data is processed [5]. Text document classification is the procedure of allocating textual documents to one or more classes or categories by constructing a model through training data. An abundance of supervised machine learning approaches exists, namely logistic regression (LR), k-nearest neighbour (KNN), support vector machines (SVM), decision tree (DT), random forests (RF), naive Bayes (NB), artificial neural network (ANN) and deep learning networks (DNN). Using a dataset of 69776 computer science research papers, which were downloaded from arXiv, we were able to classify them into thirty-five categories with an f1-score of 0.60. Logistic regression was found to be the best classier, followed closely by deep learning networks. The structure of this paper is organised as follows. Section 2 presents a background on document classification and machine learning classifiers. The literature review is described in Section 3. Section 4 consists of the methodology. Section 5 includes how the classification systems have been implemented, evaluated and tested. The paper comes to its conclusion in Section 6. http://dx.doi.org/10.12785/ijcds/0906014