2022 International Conference on Data Science, Agents & Artificial Intelligence (ICDSAAI) 979-8-3503-3384-8/22/$31.00 © 2022 IEEE Analysis of Spam Messages Using Various Machine Learning Classifier Nagaraj. P Department of Computer Science and Engineering Kalasalingam Academy of Research and Education Krishnankoil, Virudhunagar, India nagaraj.p@klu.ac.in Gopal. R Department of Information Science and Engineering Bannari Amman Institute of Technology Sathyamangalam, Erode, India gopalr@bitsathy.ac.in Sunethra B Department of Computer Science and Engineering Kalasalingam Academy of Research and Education Krishnankoil, Virudhunagar, India sunethraboganatham9@gmail.com Sumathi. R Department of Computer Science and Engineering Kalasalingam Academy of Research and Education Krishnankoil, Virudhunagar, India r.sumathi@klu.ac.in Muneeswaran. V Department of Electronics and Communication Engineering Kalasalingam Academy of Research and Education Krishnankoil, Virudhunagar, India munees.klu@gmail.com Vignesh. K Department of Computer Science and Engineering Kalasalingam Academy of Research and Education Krishnankoil, Virudhunagar, India vignesh.k@klu.ac.in Abstract Background: As people using social media increases the data generation also increases and the data generated may be safe or unsafe. If we see some applications like Twitter and mail. We get a lot of emails or twits that include all dangerous and useful things. Here to be safe from the threats and dangers we need a filter that separates useful messages from spam and helps us not to drown in a trap. And one of the approaches to do this is explained in this paper. In this paper, the algorithm followed is the Naïve Bayes classifier. This also provides the comparison between using Naïve Bayes, KNN, and Logistic Regression to solve the same problem that is spam filtering and term frequency-inverse document frequency (TFIDF). Keywords— Machine Learning, Naïve Bayes Classifier, Bag of Words, K nearest neighbours, Logistic Regression I. INTRODUCTION Spam information may come in any form that is through messages through mail or through SMS, nowadays this spam is growing due to the increase in users over the internet. Most of the spam or not useful information we get is through the internet as all the applications now a day’s work. This spam may be of any type like the spam that attacks devices and spread the virus, the spam that tries to steal money, spam that fools the users by spreading wrong information, the spam that attracts the users with false information, and more [1]. Nowadays people who generate this spam also became very intelligent that they are creating them like, if we click on the link or message or mail the malware spreads automatically. so that we will not have any chance to at least read and check whether that is spam or not. So, in those situations, we can’t get any option and at that time this spam filtering helps us to segregate spam information and saves us from danger [2]. So that it is very helpful to us to get rid of any danger caused by the data. As technology is growing day by day, there are both boons and banes because of it. We can do our work fast and share our views and tasks also become simple. Mails play a major role in our life. In this society, there will be no people who are not using mail. These emails act as an interface between people to communicate, interact and share their views. Most of the official things are done through the mail. Many industries and organizations use mail services to communicate with their employees, and mail usage increased, and in this pandemic time, even people from a kid to old all are using mail. Due to this spam, their mails may affect and people who are not much aware of this type of falsies spreading may think that the spam emails are also useful and not dangerous. For the people like this spam filtering helps much to make them aware that the mails in spam are dangerous and that affects the devices and some may affect financially also. So many people are affected by this spam. And if we try to get awareness in the people that is also not possible because sometimes by clicking on the link itself, we get to lose what we have not ever thought of [2].so by classifying them into spam we can give an idea to the user that these are spam and take care while opening this. This cannot solve the problem fully but up to some extent, this helps users to get escape from this type of threat and dangers. In this paper, spam detection is done on the spam/ham dataset. This paper also gives an idea about the all algorithms that can be used for classification. Here three algorithms are used namely Naïve Bayes Classifier which has been published by many papers and works very well for spam filtering. It is one of the best algorithms that is used for spam filtering nearest neighbor’s classifier is the second algorithm that has been used in this paper this has also given good results but not as much as the naïve Bayes. This is also clearly explained in the paper. Logistic Regression is the third algorithm that has been used in the paper. This gives us a line that classifies the spam and non-spam text. 2022 International Conference on Data Science, Agents & Artificial Intelligence (ICDSAAI) | 979-8-3503-3384-8/22/$31.00 ©2022 IEEE | DOI: 10.1109/ICDSAAI55433.2022.10028952 Authorized licensed use limited to: Charles Darwin University. Downloaded on February 06,2023 at 14:38:07 UTC from IEEE Xplore. Restrictions apply.