International Journal of Advanced and Applied Sciences, 8(2) 2021, Pages: 77-84 Contents lists available at Science-Gate International Journal of Advanced and Applied Sciences Journal homepage: http://www.science-gate.com/IJAAS.html 77 Content analytics based on random forest classification technique: An empirical evaluation using online news dataset Puteri N. E. Nohuddin 1, *, Wan M. U. Noormanshah 1 , Zuraini Zainol 2 1 Institute of IR4.0, National University of Malaysia, Bangi, Malaysia 2 Department of Computer Science, Faculty of Science and Defence Technology, National Defence University of Malaysia, Kuala Lumpur, Malaysia ARTICLE INFO ABSTRACT Article history: Received 21 June 2020 Received in revised form 1 October 2020 Accepted 7 October 2020 In this paper, a study is established for exploiting a document classification technique for categorizing a set of random online documents. The technique is aimed to assign one or more classes or categories to a document, making it easier to manage and sort. This paper describes an experiment on the proposed method for classifying documents effectively using the decision tree technique. The proposed research framework is a Document Analysis based on the Random Forest Algorithm (DARFA). The proposed framework consists of 5 components, which are (i) Document dataset, (ii) Data Preprocessing, (iii) Document Term Matrix, (iv) Random Forest classification, and (v) Visualization. The proposed classification method can analyze the content of document datasets and classifies documents according to the text content. The proposed framework use algorithms that include TF- IDF and Random Forest algorithm. The outcome of this study benefits as an enhancement to document management procedures like managing documents in daily business operations, consolidating inventory systems, organizing files in databases, and categorizing document folders. Keywords: Classification Random forest Document term matrix Term frequency–inversed document Frequency © 2020 The Authors. Published by IASE. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). 1. Introduction *The invention of many advanced computer technologies allows more people to have more tools to generate and share information like never before. Data growth is easily unnoticeable when most of it happens behind the scenes. Thus, the era of digital data explosion has increased a large volume of data. It is also reported that 80%-90% of future growth data in the form of unstructured text databases that may potentially contain interesting patterns and trends (Zainol et al., 2018). According to Google, they managed about 20 petabytes of data per day, and yet it is still steadily accumulative yearly up to 2018. The size of data has increased up to 2.5 quintillion bytes of data (Dean and Ghemawat, 2008). Nevertheless, one of the big challenges in handling big data is that we are going to process these raw data into interesting and useful information and insight. Data can be categorized in many forms such as structured * Corresponding Author. Email Address: puteri.ivi@ukm.edu.my (P. N. E. Nohuddin) https://doi.org/10.21833/ijaas.2021.02.011 Corresponding author's ORCID profile: https://orcid.org/0000-0003-0627-5630 2313-626X/© 2020 The Authors. Published by IASE. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/) (e.g., databases), semi-structured (e.g., markup language XML, open standard JSON, NoSQL, etc.), and unstructured (e.g., text files, email, social media data, websites, etc.). In general, data is processed and cleaned to be analyzed, measure, and visualize as information for a specific purpose. Then, significant information derives valuable and nontrivial knowledge. Knowledge discovery in a database (KDD) is a systematic process of mining interesting patterns and knowledge in a massive dataset. KDD consist of seven (7) main steps, which are: data cleaning, data integration, data selection, data transformation, data mining, pattern evaluation, and knowledge representation (Han et al., 2011). One of the core KDD activities is data mining that performs the extraction of interesting knowledge patterns. Data Mining (DM) embraces several different techniques and algorithms that are attempted to fit as an example of DM techniques can be found in Nohuddin et al. (2018). Regression, Link analysis, and Segmentation (Dean and Ghemawat, 2008). Association rules, clustering prediction, and classification are important techniques in DM. These techniques are divided into two (2) forms: Supervised learning and unsupervised learning. Both types cover functions capable of discovering different hidden patterns in large datasets.