International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 07 Issue: 03 | Mar 2020 www.irjet.net p-ISSN: 2395-0072 © 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 4464 Movie Genre Prediction from Plot Summaries by Comparing Various Classification Algorithms Aziz Rupawala 1 , Dhruv Pujara 2 , Mustakim Shikalgar 3 , Ekta Ukey 4 1,2,3 B.E. student, Dept. of Computer Engineering, PHCET, Maharashtra, India 4 Professor, Dept. of Computer Engineering, PHCET, Maharashtra, India ---------------------------------------------------------------------***---------------------------------------------------------------------- Abstract - Any crowd, before viewing a film peruses the motion picture plot outlines. Movie plots portray the setting of a film, yet additionally the class. In this task, we try to perform grouping of motion picture classification by utilizing the plot outlines. We perform characterization on an extensive range of classifications. We would do as such by utilizing unique grouping algorithms like SGD, Multinomial Naive Bayes, Random Forest, and Logistic Regression to locate the best result possible. With this, a great deal of time and working hours can be spared. Key Words: Movie Genre classification, Multinomial Naive Bayes, Logistic regression, Random Forest, Stochastic Gradient Descent 1. INTRODUCTION Film plot rundowns reflect the class of the motion pictures such as romance, drama, comedy, etc., in a way that individuals can effortlessly seize the category of the film. Classifying movies has been a great trouble for watch-lists creators as they had to go through the entire movie plot and determine the genre manually. In the publication, there exist several works that carry out film genre categorization, which makes use of an expansion of assets like audio [1], video, and literature from posters [2] and summaries [3]. It very well may be construed whether a plot rundown shows the genre of the film to which it has a place. Thus, this strategy can be useful during the training of film plots. These genre classification algorithms, after predictions, select the best-suited results and plot the graph of genres with which the movie is associated. Though the work revolves around genre classification, it is mainly associated with the selection of the best-suited algorithms and comparisons between their classification capabilities. Every algorithm provides different value to the work and towards the output as well. 2. METHODOLOGY 2.1 Corpus construction and Preprocessing The corpus "movies\_metadata.csv" for the model is acquired from the website "https://www.kaggle.com." The corpus consists of around 45,000 distinct items, containing movie plots and genres associated with the respective movies along with 17 redundant features. Out of the 18 genres that were present in the corpus, we have used 12 genres for training the model. The critical reason for the elimination of excessive genres is the insufficient amount of data points for making proper predictions. Table -1 depicts the statistics of the corpus used. Table -1: Conveyance of the value counts for every genre Genre Count Drama 11966 Comedy 8820 Action 4489 Documentary 3415 Horror 2619 Crime 1685 Thriller 1665 Adventure 1514 Romance 1191 Animation 1124 Fantasy 704 Total 40393 After data collection, we have removed the redundant rows which are not required for the training of our model. Furthermore, we have converted the plot texts into lower cases; also, we have removed all the null values from the corpus. In addition to this, with the help of NLTK 1 , we have discarded all the abbreviations and stop words that are irrelevant to the training of the model. Thus, we have obtained the cleaned corpus consisting of relevant features, which are movie name, plot/summary, and genre, which is required for the training of our model. 1 https://www.nltk.org/