Journal on Interactive Systems, 2023, 14:1, doi: 10.5753/jis.2023.3020 This work is licensed under a Creative Commons Attribution 4.0 International License. Fake news detection: a systematic literature review of machine learning algorithms and datasets Humberto Fernandes Villela [ Universidade FUMEC | humberto.villela@gmail.com ] Fábio Corrêa [ Universidade FUMEC | fabiocontact@gmail.com ] Jurema Suely de Araújo Nery Ribeiro [ Universidade FUMEC | jurema.nery@gmail.com ] Air Rabelo [ Universidade FUMEC | air@fumec.br ] Dárlinton Barbosa Feres Carvalho [ Universidade Federal de São João del-Rei | darlinton@acm.org ] Abstract Fake news (i.e., false news created to have a high capacity for dissemination and malicious intentions) is a problem of great interest to society today since it has achieved unprecedented political, economic, and social im- pacts. Taking advantage of modern digital communication and information technologies, they are widely propa- gated through social media, being their use intentional and challenging to identify. In order to mitigate the damage caused by fake news, researchers have been seeking the development of automated mechanisms to detect them, such as algorithms based on machine learning as well as the datasets employed in this development. This research aims to analyze the machine learning algorithms and datasets used in training to identify fake news published in the literature. It is exploratory research with a qualitative approach, which uses a research protocol to identify studies with the intention of analyzing them. As a result, we have the algorithms Stacking Method, Bidirectional Recurrent Neural Network (BiRNN), and Convolutional Neural Network (CNN), with 99.9%, 99.8%, and 99.8% accuracy, respectively. Although this accuracy is expressive, most of the research employed datasets in controlled environments (e.g., Kaggle) or without information updated in real-time (from social networks). Still, only a few studies have been applied in social network environments, where the most significant dissemination of disinfor- mation occurs nowadays. Kaggle was the platform identified with the most frequently used datasets, being suc- ceeded by Weibo, FNC-1, COVID-19 Fake News, and Twitter. For future research, studies should be carried out in addition to news about politics, the area that was the primary motivator for the growth of research from 2017, and the use of hybrid methods for identifying fake news. Keywords: Algorithms, datasets, accuracy, fake news, artificial intelligence. 1 Introduction Currently, the term fake news is on the rise as this type of news can remarkably influence society, promoting signifi- cant political, economic, or social impacts (Zhang et al., 2016; Islam et al., 2020). They are false news created to be highly broadcastable, usually with malicious intent (i.e., to deceive, cause ambiguity, or falsehood). In the political context, specifically, Almeida et al. (2021) point out the impacts of fake news on the US presidential elections in 2016, having President Donald Trump elected. In Brazil, similar impacts were imputed to the election of President Jair Bolsonaro in 2018. Due to the behavior of many Brazilian voters relying on social media as the pri- mary source to access news, this channel is fruitful for the proliferation of fake news (ALMEIDA et al., 2021). Due to the relevant impact this type of news has caused on society, researchers have been seeking to develop ways to detect them. Using algorithms for the automated identifi- cation of fake news presents itself as a promising line of re- search. The quality of these algorithms is commonly veri- fied by accuracy, which is the measure of correctness in classifying whether a news item is true or false (Chapra & Canale, 2016). That is, accuracy represents the assertiveness characteristic of the algorithm in detecting fake news. However, the quality of algorithms is directly related to their specific purpose, restricted to a language or news style, that is, according to the datasets used for training (Ahuja & Kumar, 2020). For this reason, Jiang et al. (2021) and Ahuja and Kumar (2020) recommend continuing research using varied datasets, such as in languages other than English, given that this is the most used. Nevertheless, recurring studies to detect fake news have achieved relevant accomplishments. The research by Bur- foot and Baldwin (2009) reached 71% accuracy in detecting fake news, while Ahmed, Traore, and Saad (2017) raised this metric to 87%, the same percentage achieved by Low et al. (2022). Medeiros and Braga (2020) carried out research aiming, among others, to identify the accuracy achieved by algo- rithms in detecting fake news. From the 32 studies exam- ined, the proposed algorithms show an accuracy of 73.7% to 98.0%. This result highlights the relevance of this theme and the constant search to promote a more assertive detection of fake news, to minimize its impacts in the aforementioned political, economic, or social contexts, among others. Accordingly, this research is grounded on the recommen- dations of Ahmed, Traore, and Saad (2017), Ahmad et al. (2020), Agarwal et al. (2020), Aslam et al. (2021) e Jiang et al. (2021) regarding the use of algorithms to identify fake news. Therefore, the objective of this investigation is to an- alyze the accuracy obtained and the datasets used in fake news identification algorithms. This study intends to contribute by highlighting more successful algorithms and datasets in identifying fake news