International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 05 Issue: 06 | June-2018 www.irjet.net p-ISSN: 2395-0072
© 2018, IRJET | Impact Factor value: 6.171 | ISO 9001:2008 Certified Journal | Page 1182
Automated Document Summarization and Classification Using Deep
Learning
Krushna Sharma
1
, Avinash Gaikwad
2
, Swapnil Patil
3
, Pradeep Kumar
4
, D.P. Salapurkar
5
1,2,3,4
B.E. (Computer Engineering), Sinhgad College of Engineering, Pune, Maharashtra, India
5
Assistant Professor, Dept. of Computer Engineering, Sinhgad College of Engineering, Pune, Maharashtra, India
---------------------------------------------------------------------***---------------------------------------------------------------------
Abstract – The exponential growth of the Internet has led to
great deal of interest in developing useful and efficient tools
and software to assist users in searching the web for relevant
documents. Document classification is generally defined as
content-based assignment of one or more predefined
categories to documents. Document classification appears in
many applications, including email-filtering, news monitoring,
etc. It is not feasible to classify these documents manually and
present automated classification methods have drawbacks like
low accuracy and dependency on humans for document
tagging.
The proposed system uses deep learning methods to speed up
the classification process and recommend relevant documents.
The proposed deep learning algorithm -’Recurrent Neural
Network with Convolutional Neural Network’ helps in
construction of a robust classifier model using variety of data
for training. This classifier can then be improvised to classify
documents in a business database automatically.
Key Words: Summarization, Classification, Neural
Network, Deep Learning, Recurrent Neural Network
(RNN), Convolutional Neural Network (CNN), Recurrent
Convolutional Neural Network (RCNN).
1. INTRODUCTION
The automated categorization (or classification) of texts into
predefined categories has witnessed a booming interest in
the last 10 years, due to the increased availability of
documents in digital form and the ensuing need to organize
them. In the research community the dominant approach to
this problem is based on machine learning techniques: a
general inductive process automatically builds a classifier by
learning, from a set of pre-classified documents, the
characteristics of the categories. The advantages of this
approach over the knowledge engineering approach
(consisting in the manual definition of a classifier by domain
experts) are a very good effectiveness, considerable savings
in terms of expert labor power, and straightforward
portability to different domains.
Until the late ’ͺ0s the most popular approach to TC, at least
in the Dzoperationaldz (i.e., real world applications)
community, was a knowledge engineering (KE) one,
consisting in manually defining a set of rules encoding expert
knowledge on how to classify documents under the given
categories. In the ’ͻ0s this approach has increasingly lost
popularity (especially in the research community) in favor of
the machine learning (ML) paradigm, according to which a
general inductive process automatically builds an automatic
text classifier by learning, from a set of pre-classified
documents, the characteristics of the categories of interest.
The advantages of this approach are an accuracy comparable
to that achieved by human experts, and a considerable
savings in terms of expert labor power, since no intervention
from either knowledge engineers or domain experts is
needed for the construction of the classifier or for its porting
to a different set of categories.
The proposed system implements Recurrent Neural network
along with Convolutional neural network to build the
classifier model. Only summary of document is used for
classification phase which speeds up the training phase
considerably.
1.1 Background and Basics
With the dramatic growth of the Internet, people are
overwhelmed by the tremendous amount of online
information and documents. This expanding availability of
documents has demanded exhaustive research in the area of
automatic text summarization. A summary is defined as Dza
text that is produced from one or more texts, that conveys
important information in the original text(s), and that is no
longer than half of the original text(s) and usually,
significantly less than thatdz. Automatic text summarization is
the task of producing a concise and fluent summary while
preserving key information content and overall meaning. In
recent years, numerous approaches have been developed for
automatic text summarization and applied widely in various
domains. For example, search engines generate snippets as
the previews of the documents. Other examples include news
websites which produce condensed descriptions of news
topics usually as headlines to facilitate browsing or
knowledge extractive approaches. Automatic text
summarization gained attraction as early as the 1950s. An
important research of these days was for summarizing
scientific documents.
Document classification or document categorization is a
problem in library science, information science and computer
science. The task is to assign a document to one or more
classes or categories. This may be done "manually" (or
"intellectually") or algorithmically. The intellectual
classification of documents has mostly been the province of
library science, while the algorithmic classification of
documents is mainly in information science and computer