International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 03 | Mar 2020 www.irjet.net p-ISSN: 2395-0072
© 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 4464
Movie Genre Prediction from Plot Summaries by Comparing Various
Classification Algorithms
Aziz Rupawala
1
, Dhruv Pujara
2
, Mustakim Shikalgar
3
, Ekta Ukey
4
1,2,3
B.E. student, Dept. of Computer Engineering, PHCET, Maharashtra, India
4
Professor, Dept. of Computer Engineering, PHCET, Maharashtra, India
---------------------------------------------------------------------***----------------------------------------------------------------------
Abstract - Any crowd, before viewing a film peruses the
motion picture plot outlines. Movie plots portray the setting
of a film, yet additionally the class. In this task, we try to
perform grouping of motion picture classification by
utilizing the plot outlines. We perform characterization on
an extensive range of classifications. We would do as such by
utilizing unique grouping algorithms like SGD, Multinomial
Naive Bayes, Random Forest, and Logistic Regression to
locate the best result possible. With this, a great deal of time
and working hours can be spared.
Key Words: Movie Genre classification, Multinomial Naive
Bayes, Logistic regression, Random Forest, Stochastic
Gradient Descent
1. INTRODUCTION
Film plot rundowns reflect the class of the motion pictures
such as romance, drama, comedy, etc., in a way that
individuals can effortlessly seize the category of the film.
Classifying movies has been a great trouble for watch-lists
creators as they had to go through the entire movie plot
and determine the genre manually.
In the publication, there exist several works that carry out
film genre categorization, which makes use of an
expansion of assets like audio [1], video, and literature
from posters [2] and summaries [3]. It very well may be
construed whether a plot rundown shows the genre of the
film to which it has a place. Thus, this strategy can be
useful during the training of film plots.
These genre classification algorithms, after predictions,
select the best-suited results and plot the graph of genres
with which the movie is associated.
Though the work revolves around genre classification, it is
mainly associated with the selection of the best-suited
algorithms and comparisons between their classification
capabilities. Every algorithm provides different value to
the work and towards the output as well.
2. METHODOLOGY
2.1 Corpus construction and Preprocessing
The corpus "movies\_metadata.csv" for the model is
acquired from the website "https://www.kaggle.com." The
corpus consists of around 45,000 distinct items, containing
movie plots and genres associated with the respective
movies along with 17 redundant features. Out of the 18
genres that were present in the corpus, we have used 12
genres for training the model. The critical reason for the
elimination of excessive genres is the insufficient amount
of data points for making proper predictions. Table -1
depicts the statistics of the corpus used.
Table -1: Conveyance of the value counts for every genre
Genre Count
Drama 11966
Comedy 8820
Action 4489
Documentary 3415
Horror 2619
Crime 1685
Thriller 1665
Adventure 1514
Romance 1191
Animation 1124
Fantasy 704
Total 40393
After data collection, we have removed the redundant rows
which are not required for the training of our model.
Furthermore, we have converted the plot texts into lower
cases; also, we have removed all the null values from the
corpus. In addition to this, with the help of NLTK
1
, we have
discarded all the abbreviations and stop words that are
irrelevant to the training of the model. Thus, we have
obtained the cleaned corpus consisting of relevant features,
which are movie name, plot/summary, and genre, which is
required for the training of our model.
1
https://www.nltk.org/