INTERNATIONAL RESEARCH JOURNAL OF ENGINEERING AND TECHNOLOGY (IRJET) E-ISSN: 2395-0056 VOLUME: 06 ISSUE: 11 | NOV 2019 WWW.IRJET.NET P-ISSN: 2395-0072 © 2019, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 2374 Determining Document Relevance using Keyword Extraction Vinay Patil 1 , Sachin Jadhav 2 , Pawan Lokapur 3 , Akash Mhatre 4 , Prof. Tushar Ghorpade 5 1,2,3,4 Student, Dept of computer Engineering, Ramrao Adik Institute of Technology 5 Assistant Professor, Dept. Of Computer Engineering, Ramrao Adik Institute of Technology ----------------------------------------------------------------------***--------------------------------------------------------------------- Abstract - This paper lies in the data analysis domain describing about the system which attempts to search for a relevant document from a large set of documents, or more specifically to fetch a summary of answer for a given query from one of the selected relevant document. In few of the organizations involving large data set in form of documents, where clients and customers need to retrieve some of these documents frequently, the process of searching this document becomes a very hectic task. For instance, the educational institute like Mumbai University which has large corpus of documents. So to overcome this manual searching approach we have proposed a system which successfully fetches the desired documents to user based on query provided to system. This is being done by priory extracting the documents at time of uploading and storing the necessary stats required for search algorithm. For document extraction we use the TF-IDF algorithm. And during search we analyse the TF-IDF weight of keywords in search algorithm to fetch the desired set of documents to user. There will also be a feedback mechanism for user to interact with system through which user can upvote or downvote for a particular document thus making the system to learn and improve its search in future. The system is supposed to deliver accurate results for every query given by user combined with less processing time. The system contains three operational elements i.e. keyword extraction, search module and topic selection. General Terms: Data Mining, Data Analysis, Keyword Extraction. Keywords: TF-IDF, QnA, Document Search, Artificial Feedback, Keyword Extraction. 1. INTRODUCTION In recent world of digitalization, data is coming in huge amounts from many sources like news, social media, banking and education. And hence because of this unregulated and unordered growth in data, there is a need of automated information retriever which will help users to retrieve relevant information by searching in piles of unstructured data. But there exists challenges to implement one, such as retrieving correct sense of information. Information retrievers is emerged as an important research area in recent past. In this regard, study of existing work is useful to carry on further research. Using keywords to predict document contents is accurate and fast method. Keywords can be used as entry points into an index which will then help to identify files, records, texts or any unstructured data. However, a large number of data is not marked using keyword and giving it to a human for tagging will be difficult and also needs large amount of time. Thus, there is need of an algorithm which will tag every document or data by its relevant keywords. Hence in this paper, a keyword extraction algorithm is presented also a faster method of searching through those extracted keyword is proposed. An algorithm named “TF-IDF (Term Frequency- Inverse Document Frequency)” is one of the simple method which serves the purpose of retrieving keywords with high accuracy. The working of TF-IDF algorithm is based on the number of times a word appeared in a sentence and the number of times it appeared in whole corpus. More the occurrences in a document, more is the weightage and contradictorily more the occurrences in corpus, less is the weightage. The keywords extracted by this algorithm can be used in searching algorithm which we proposed. This searching algorithm make use of combination concept and sets to give list of documents which contains the search query. It also discusses about different databases structures used for text extraction and a quick question answering bot which can be used in different domains. Finally, it discusses briefly about issues and research challenges faced by us along with future direction. 2. LITERATURE REVIEW In LDA based Paper [1], system uses generative probabilistic model on Chinese corpus data and categories documents by different topic names, but the major drawback is that it does not remove stop-words and also the topic names or subject needs to be known in advance. hence limiting the corpus contents. In TF-IDF based paper [2], a system is developed which will search social media platforms to get the latest trends which then can be used for advertisement purpose. Paper presented an algorithm which will monitor Instagram accounts for 50 recently posted photos and its captions. These captions are then analysed to get latest trends and fashion accessories which will be used for marketing those product effectively. But limitations of this system is it is concentrated only on 20 most followed users on Instagram and hence the scope of this system is narrow down to 20 users only. Modified IDF paper [3] focuses more on ambiguous word sensing by using KNN approach which has less worth