INTERNATIONAL RESEARCH JOURNAL OF ENGINEERING AND TECHNOLOGY (IRJET) E-ISSN: 2395-0056
VOLUME: 06 ISSUE: 11 | NOV 2019 WWW.IRJET.NET P-ISSN: 2395-0072
© 2019, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 2374
Determining Document Relevance using Keyword Extraction
Vinay Patil
1
, Sachin Jadhav
2
, Pawan Lokapur
3
, Akash Mhatre
4
, Prof. Tushar Ghorpade
5
1,2,3,4
Student, Dept of computer Engineering, Ramrao Adik Institute of Technology
5
Assistant Professor, Dept. Of Computer Engineering, Ramrao Adik Institute of Technology
----------------------------------------------------------------------***---------------------------------------------------------------------
Abstract - This paper lies in the data analysis domain
describing about the system which attempts to search for
a relevant document from a large set of documents, or
more specifically to fetch a summary of answer for a given
query from one of the selected relevant document. In few
of the organizations involving large data set in form of
documents, where clients and customers need to retrieve
some of these documents frequently, the process of
searching this document becomes a very hectic task. For
instance, the educational institute like Mumbai University
which has large corpus of documents. So to overcome this
manual searching approach we have proposed a system
which successfully fetches the desired documents to user
based on query provided to system. This is being done by
priory extracting the documents at time of uploading and
storing the necessary stats required for search algorithm.
For document extraction we use the TF-IDF algorithm.
And during search we analyse the TF-IDF weight of
keywords in search algorithm to fetch the desired set of
documents to user. There will also be a feedback
mechanism for user to interact with system through which
user can upvote or downvote for a particular document
thus making the system to learn and improve its search in
future. The system is supposed to deliver accurate results
for every query given by user combined with less
processing time. The system contains three operational
elements i.e. keyword extraction, search module and topic
selection.
General Terms: Data Mining, Data Analysis, Keyword
Extraction.
Keywords: TF-IDF, QnA, Document Search, Artificial
Feedback, Keyword Extraction.
1. INTRODUCTION
In recent world of digitalization, data is coming in huge
amounts from many sources like news, social media,
banking and education. And hence because of this
unregulated and unordered growth in data, there is a
need of automated information retriever which will help
users to retrieve relevant information by searching in
piles of unstructured data. But there exists challenges to
implement one, such as retrieving correct sense of
information. Information retrievers is emerged as an
important research area in recent past. In this regard,
study of existing work is useful to carry on further
research. Using keywords to predict document contents
is accurate and fast method. Keywords can be used as
entry points into an index which will then help to
identify files, records, texts or any unstructured data.
However, a large number of data is not marked using
keyword and giving it to a human for tagging will be
difficult and also needs large amount of time. Thus, there
is need of an algorithm which will tag every document or
data by its relevant keywords. Hence in this paper, a
keyword extraction algorithm is presented also a faster
method of searching through those extracted keyword is
proposed. An algorithm named “TF-IDF (Term
Frequency- Inverse Document Frequency)” is one of the
simple method which serves the purpose of retrieving
keywords with high accuracy. The working of TF-IDF
algorithm is based on the number of times a word
appeared in a sentence and the number of times it
appeared in whole corpus. More the occurrences in a
document, more is the weightage and contradictorily
more the occurrences in corpus, less is the weightage.
The keywords extracted by this algorithm can be used in
searching algorithm which we proposed. This searching
algorithm make use of combination concept and sets to
give list of documents which contains the search query.
It also discusses about different databases structures
used for text extraction and a quick question answering
bot which can be used in different domains. Finally, it
discusses briefly about issues and research challenges
faced by us along with future direction.
2. LITERATURE REVIEW
In LDA based Paper [1], system uses generative
probabilistic model on Chinese corpus data and
categories documents by different topic names, but the
major drawback is that it does not remove stop-words
and also the topic names or subject needs to be known in
advance. hence limiting the corpus contents.
In TF-IDF based paper [2], a system is developed which
will search social media platforms to get the latest trends
which then can be used for advertisement purpose.
Paper presented an algorithm which will monitor
Instagram accounts for 50 recently posted photos and its
captions. These captions are then analysed to get latest
trends and fashion accessories which will be used for
marketing those product effectively. But limitations of
this system is it is concentrated only on 20 most
followed users on Instagram and hence the scope of this
system is narrow down to 20 users only.
Modified IDF paper [3] focuses more on ambiguous word
sensing by using KNN approach which has less worth