Sumathi Pawar, Manjula Gururaj Rao, Karuna Pandith, Text document categorisation using random forest and C4.5 decision tree classifier, International Journal of Computational Systems Engineering Vol. 7, No. 2-4, 11 Aug 2023, pp 211-220. https://www.inderscienceonline.com/doi/abs/10.1504/IJCSYSE.2023.132924 Text Document Categorization Using Random Forest and c4.5 Decision Tree Classifier Dr Sumathi Pawar 1* , Dr Karuna Pandith 2 , Dr Manjula Gururaj Rao 3 , 1,2,3 Information Science and Engineering, NMAMIT, Nitte University, Karkala, Karnataka, India, 574110 *Corresponding Authors: pawarsumathi@gmail.com Abstract: In reality, documentation is the most significant and rapidly developing field due to the restricted amount of time in the preparation of the documentation. Applications for text classification include language and item identification, document indexing, populating hierarchical catalogues of web resources, and word sense disambiguation. There are numerous texts that serve as documentation and strategies for categorization have been created to improve efficiency. The proposed system focused on categorizing and documenting text using the Ensemble Learning Technique of Random Forest method and the c4.5 Decision tree classifier. This system's processes include construction of decision tree text classifiers, training the constructed models as a part of implementation, dimension reduction, tf/idf indexing of the documents, clustering the terms using brown clustering and running the testing dataset through the classifiers as a part of document categorization. Orange tool and Python libraries are used to implement the system. It is found that in Random Forest approach efficiency is increased due to proper construction of text classifiers. Keywords: Dimensionality Reduction, KE Approach, Indexing, Machine Learning, Tf/Idf, Ensemble Learning. Acknowledgements: I acknowledge Dr Karuna Pandit and Dr Manjula GuruRaj for their valuable inputs, Dr Karuna Pandith have analyzed the mathematical equation need to be used and Dr Manjula GuruRaj was helped in implementing the concept of C4.5 algorithm. 1. INTRODUCTION The activity of categorizing texts based on pre-defined categories called as text categorization (TC). Knowledge extraction is base of classification. According to the single label or multiple label that can be applied to a document, classification can be single vs multiple label. Classification procedures can also be used to rank documents and they can only classify the top-ranked documents rather than the complete collection [1]. Other uses for text categorization include automatic document organisation indexing, text filtering, resolving ambiguities in natural language, speech categorization, multimedia document categorization, language identification and automatic essay grading [2][3]. In KE (Knowledge Engineering) approach rules need to be framed. One rule for one category. First approach is DNF ->Category Formula. Rules need to be framed by Knowledge Engineer in association with domain Engineer. But this is manual approach. System requires an automatic approach which builds classifier or learner which is Machine Learning (ML) approach. In ML approach training set, testing set & validation sets are available.