International Journal of Advanced and Applied Sciences, 8(2) 2021, Pages: 77-84
Contents lists available at Science-Gate
International Journal of Advanced and Applied Sciences
Journal homepage: http://www.science-gate.com/IJAAS.html
77
Content analytics based on random forest classification technique: An
empirical evaluation using online news dataset
Puteri N. E. Nohuddin
1,
*, Wan M. U. Noormanshah
1
, Zuraini Zainol
2
1
Institute of IR4.0, National University of Malaysia, Bangi, Malaysia
2
Department of Computer Science, Faculty of Science and Defence Technology, National Defence University of Malaysia, Kuala
Lumpur, Malaysia
ARTICLE INFO ABSTRACT
Article history:
Received 21 June 2020
Received in revised form
1 October 2020
Accepted 7 October 2020
In this paper, a study is established for exploiting a document classification
technique for categorizing a set of random online documents. The technique
is aimed to assign one or more classes or categories to a document, making it
easier to manage and sort. This paper describes an experiment on the
proposed method for classifying documents effectively using the decision
tree technique. The proposed research framework is a Document Analysis
based on the Random Forest Algorithm (DARFA). The proposed framework
consists of 5 components, which are (i) Document dataset, (ii) Data
Preprocessing, (iii) Document Term Matrix, (iv) Random Forest
classification, and (v) Visualization. The proposed classification method can
analyze the content of document datasets and classifies documents according
to the text content. The proposed framework use algorithms that include TF-
IDF and Random Forest algorithm. The outcome of this study benefits as an
enhancement to document management procedures like managing
documents in daily business operations, consolidating inventory systems,
organizing files in databases, and categorizing document folders.
Keywords:
Classification
Random forest
Document term matrix
Term frequency–inversed document
Frequency
© 2020 The Authors. Published by IASE. This is an open access article under the CC
BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
1. Introduction
*The invention of many advanced computer
technologies allows more people to have more tools
to generate and share information like never before.
Data growth is easily unnoticeable when most of it
happens behind the scenes. Thus, the era of digital
data explosion has increased a large volume of data.
It is also reported that 80%-90% of future growth
data in the form of unstructured text databases that
may potentially contain interesting patterns and
trends (Zainol et al., 2018). According to Google, they
managed about 20 petabytes of data per day, and yet
it is still steadily accumulative yearly up to 2018. The
size of data has increased up to 2.5 quintillion bytes
of data (Dean and Ghemawat, 2008). Nevertheless,
one of the big challenges in handling big data is that
we are going to process these raw data into
interesting and useful information and insight. Data
can be categorized in many forms such as structured
* Corresponding Author.
Email Address: puteri.ivi@ukm.edu.my (P. N. E. Nohuddin)
https://doi.org/10.21833/ijaas.2021.02.011
Corresponding author's ORCID profile:
https://orcid.org/0000-0003-0627-5630
2313-626X/© 2020 The Authors. Published by IASE.
This is an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/4.0/)
(e.g., databases), semi-structured (e.g., markup
language XML, open standard JSON, NoSQL, etc.), and
unstructured (e.g., text files, email, social media data,
websites, etc.).
In general, data is processed and cleaned to be
analyzed, measure, and visualize as information for a
specific purpose. Then, significant information
derives valuable and nontrivial knowledge.
Knowledge discovery in a database (KDD) is a
systematic process of mining interesting patterns
and knowledge in a massive dataset. KDD consist of
seven (7) main steps, which are: data cleaning, data
integration, data selection, data transformation, data
mining, pattern evaluation, and knowledge
representation (Han et al., 2011). One of the core
KDD activities is data mining that performs the
extraction of interesting knowledge patterns. Data
Mining (DM) embraces several different techniques
and algorithms that are attempted to fit as an
example of DM techniques can be found in Nohuddin
et al. (2018). Regression, Link analysis, and
Segmentation (Dean and Ghemawat, 2008).
Association rules, clustering prediction, and
classification are important techniques in DM. These
techniques are divided into two (2) forms:
Supervised learning and unsupervised learning. Both
types cover functions capable of discovering
different hidden patterns in large datasets.