Asian Journal of Computer Science And Information Technology 5:11 (2015) 62 – 66.
Contents lists available at www.innovativejournal.in
Asian Journal of Computer Science And Information Technology
Journal Homepage: http://innovativejournal.in/ajcsit/index.php/ajcsit
62
AN ENUMERATIVE FRAMEWORK FOR EXTRACTION OF BAG-OF-WORDS FROM
LEGAL DOCUMENTS
Basaveswar Rao. B, B.V.Rama Krishna, Gangadhara Rao. K, Chandan. K
Dept. Of Computer Science & Engineering, Acharya Nagarjuna University-522510, India
ARTICLE INFO ABSTRACT
Corresponding Author:
B.V.Rama Krishna
Dept. Of Computer Science &
Engineering, Acharya Nagarjuna
University-522510, India
Key Words: Stop-Words,
Stemming, Porter Stemmer, Bag-
of-Words, Judgments, Word-
Frequency.
DOIhttp://dx.doi.org/10.15520/
ajcsit.v5i11.35
In this paper an enumerative frame work is developed for extraction of Bag-of-
Words from legal documents. For this purpose 100 judgments of Supreme
Court of India related to Dowry cases are considered. From the judgments the
case notes are taken as a text input and extracted a set of Bag-of-Words. A
novelistic algorithm is presented and implemented for this purpose. For
filtering the insignificant words from the Bag-of-Words a threshold value has
been applied on word frequencies. This Bag-of-Words may be utilized in Data
Mining applications to extract Knowledge Discovery from judgments.
©2015, AJCSIT, All Right Reserved.
1. INTRODUCTION
Text mining is an emerging research area in
modern era because most of the information available in
the form of electronic documents. These electronic
documents are available in digital libraries, online chats, e-
mails, social media and in the form of fields in downloaded
PDF/WORD documents. These electronic documents
would have potential influence on the areas like
marketing, financial, medical and legal fields especially
when one tries to analyze these documents. Both public
and private sectors use these data repositories as a source
of data for Data Mining and Text Mining Techniques. There
is a need to identify new knowledge discovery techniques
and Information Retrieval techniques to better the use of
these resources.
The Text Mining results such as Association rule
mining, Generalization, Classification, Clustering and
Outlier Analysis can be applied on text documents during
Text Mining process [19]. Natural Language
Processing(NLP) plays a key role in text mining as they
support wide range of services like syntactical parsing,
linguistic analysis, word stemming, multi word phrase
grouping, synonym normalization, parts-of-speech,
tagging, word sense disambiguation, anaphora resolution
and role determination. NLPs increase the effectiveness of
text mining during mining natural language documents
[1]. Machine Learning support both supervised and
unsupervised learning techniques [7]. They show high
degree of performance with good accuracy during text
mining. Some popular machine learning techniques used
in text mining are Self Organizing Maps (SOM), Support
Vector Machine (SVM), Bayesian Networks, Boosting
Algorithm, Latent Variable model and Helmholtz Machines
[2]. Machine Learning supports classification, clustering,
filtering, extraction, retrieval and data mining services to
text processing [18]. Information Retrieval Systems (IRS)
employ artificial intelligence mechanisms not only to
retrieve information but also helpful in decision support
[14]. IRS is strongly supported with Tautology, Boolean
Algebra and Fuzzy Logic to affectively extract knowledge
from information.
The Information Retrieval in legal documents is a
key area of research from the past decade. The
classification, clustering and other data mining techniques
are used. All the studies are based on document
representation. Information Retrieval Systems employed
to retrieve information from legal documents are named
as ‘Legal IR Systems’ [9]. With the advent of knowledge
engineering Artificial Intelligence and Case Based
reasoning are tailored to design new generation Legal IRS.
Knowledge Discovery of Data (KDD) approach to legal
documents is a staged extraction of knowledge from data
repositories. The essential stages in Legal data mining are
preprocessing, extraction, transformation, loading, rule
mining, classification, clustering and visualization. There is
a need to identify new Data Mining Technique as well as
innovative procedures for better use of these resources.
‘Bag of Words’ (BoW) is to perform associations
among user queries and documents retrieved. This is also
used in Machine Learning applications over text
documents. Further BoW reduces the time complexity of
analysis and also increases the accuracy. Classification and
Ranking of judgments based on user query is the goal of
judicial search engines. In this process BoW extraction
from legal documents is an essential phase [10] and it
provides a basis for applying Data Mining Techniques on
the Data Structures created. Not much research has been
done in this direction on the Indian Legal Documents
generated. The main goal of this paper is to provide a