IJSRSET18495 | Received : 01 July 2018 | Accepted : 10 July 2018 | July-August-2018 [ 4 (9) : 51-58]
© 2018 IJSRSET | Volume 4 | Issue 9 | Print ISSN: 2395-1990 | Online ISSN : 2394-4099
Themed Section : Engineering and Technology
51
Document Categorization by using Weighted J48 Classifier
Sonali Suskar, Dr. S. D. Babar
Department of Computer Engineering SIT College of Engineering, Lonavala, Maharashtra, India
ABSTRACT
In the field of information retrieval text categorization is the key research area in present. The text categorization
selects entries from set of prebuilt categories and allots those to a document. Learning with high dimensional data
space is challenging in a text categorization method. Learning with high-dimensional features may prompt a heavy
calculation overhead and may affect the classification performance of classifiers because of unrelated and repetitive
features. To improve the “scourge of dimensionality “issue and to accelerate the learning procedure of classifiers, it
is important to perform feature reduction to reduce the size of features. This paper introduces a Bayesian
arrangement approach and WeightedJ48 classifier for auto text categorization using class-specific features. For text
classification, the proposed strategy selects a specific feature subset for every class. The presented system
reconstructs PDF in raw data space from class specific PDF in low dimensional feature space and assembles Bayes
classification rule utilizing Baggenstoss PDF Projection Theorem. The detectable importance of this methodology
is that many feature selection criteria. The WeightedJ48 classifier saves the time and memory. The proposed
system also uses Term weighting concept for pre-processing. These methods increase the accuracy of classification,
feature selection process, and improve the system performance.
Keywords: Text categorization, class-specific features, Feature selection, PDF projection and estimation, dimension
reduction, WeightedJ48, Term weighting.
I. INTRODUCTION
As data size on net as well as different companies will
grow, there is huge requirement of a method for
dealing with the huge size of information that can be
filter and deals these information types.
The main categories is to separate the free text files in
the categories that are defined previously,
categorization of emails and files in folder tree,
labelling of the topics, Particular processing
operations, structures search as well as surfing or
searching files which has long term interests or
dynamic task depending interests. In different
contexts professionals are selected to classes the new
items, yet this procedure is especially time taking and
in addition will as exorbitant so bounding its
applicability apparently there is a more enthusiasm
for the research and development work of the
strategies for text categorization automatically. There
are various classifications and machine-learning
techniques are developed for categorization of text
like the one rule learning algorithms nearest
neighbour‟s classifiers, Support Vector Machines,
decision trees etc.
Text categorization (TC) described as text
classification, in this a documents are automatically
classified by using predefined set. This process can be
used in many systems; also in automated indexing of
scientific articles based on predefined thesauri of
terms, which are technical, filing patents inside the
patent directories, chosen dissemination of data-to-
data consumers, hierarchical catalogues for automated