Fake and Hyper-partisan News Identification Vlad Cristian Dumitru University Politehnica of Bucharest Bucharest, Romania d.vladk@gmail.com Traian Rebedea University Politehnica of Bucharest Bucharest, Romania traian.rebedea@cs.pub.ro ABSTRACT Fake news represent a phenomenon that has widened over the last period of time. Their spreading is facilitated by social media platforms, which have a major influence on people’s lives. In this context, automated solutions are sought to detect the fake news circulating online. In this paper, we developed models designed to detect fake and hyper-partisan news using various machine learning algorithms. The paper contains a description of the attributes used and their importance in identifying fake news, and also different trained classifiers using more dataset distributions in order to see the differences between their results. In addition to the results analyzed on existing datasets in the field, we tested the ability of models to work in a real environment. Thus, we collected URLs from Twitter, tested the models on these data and made statistics on the number of fake news articles found by each classifier. We also gave some examples detected by models on articles from Twitter which we considered fake news or hyper-partisan news by manual analyzing their content. Author Keywords Fake news, Hyper-partisan news, Linguistic features, Natural language processing, Machine learning ACM Classification Keywords I.2.7 Natural Language Processing INTRODUCTION Misinformation, hiding the truth, or presenting information in a manner that is influential in order to serve certain interests are tactics that have their roots well established throughout history. Examples of this are the famous Trojan horse that proved to be decisive in winning a war, or the 1835 publications on the discovery of life on the moon that made from The Sun one of the world's most popular newspapers [1]. Nowadays, with the development of the Internet and technology, news tend to circulate to a large extent in the online environment. Recent studies show that online sources exceed in some cases the power of the television to transmit the information. At the same time, the traditional newspapers have lost much ground in front of online news [2]. These happen due to the fact that people's tendencies have changed. They find it quicker and cheaper to search for online news than to watch a news storyline on TV or to buy a newspaper. Social media platforms also play an important role in this change. Their users can easily share news, comment, and have free talks about the various news circulating in their online circle of friends. However, these advantages of social networks make them a vulnerable point which can be exploited to disseminate fake news. Although there are more subcategories related to fake news such as satire, parody or clickbait, a general definition of the term could be the following: fake news represent a way to spread false information in order to mislead the public, damage the reputation of an entity or have a political or financial gain. The idea of misleading and influencing the public is also linked to the notion of hyper-partisan news. The latter have the role of presenting extremist or conspiratorial opinions with intentional misconceptions. Given the fact that this phenomenon of online fake news has been growing sharply lately, especially with the US presidential election campaign in 2016, attempts are being made to find automated solutions to identify such news articles circulating online. Approaches aim in most cases to create automated learning models that process language indicators of texts. Considering the context described above, the paper we proposed consists of creating such machine learning models like Support Vector Machines (SVM), Random Forests (RF) and Logistic Regression (LR) that detect fake and hyper-partisan news. For training the models, we used several datasets from domain. In order to verify the correctness of the obtained models, we used various metrics showing the ability of the models to correctly identify the input data. Although datasets may have a certain theme and may influence the ability of such a model to properly classify an item [3], obtained classifiers had been also tested in a real work environment. For this, we used the Twitter platform to collect news URLs found in various tweets which were further used as data input for models. We conducted a statistical analysis of the number of news detected by each model based on the value of the decision threshold used to see if a news item is false. In the end, some examples of fake and hyper-partisan news detected by models from Twitter data can be observed. In this paper, our goals are to obtain classifiers with good metrics, to identify linguistic features that can be used to differentiate a real article from a fake or hyper-partisan one and to check if the existing datasets can be used to obtain models capable of predicting fake news in real-time. The paper continues with a section that describes the current state of the fake news phenomenon and the systems and the approaches studied so far. The following section 60 Proceedings of RoCHI 2019