High Dimensional Latent Space Variational AutoEncoders for Fake News Detection Saad Sadiq, Nicolas Wagner, Mei-Ling Shyu Department of Electrical and Computer Engineering University of Miami Coral Gables, FL, USA Emails: {s.sadiq, nxw178, shyu}@miami.edu Daniel Feaster Department of Public Health Sciences University of Miami - Miller School of Medicine Miami, FL, USA dfeaster@med.miami.edu Abstract—With the advent of social media and cell phones, news is now far more reaching and impactful than ever before. This comes with the exponential increase in fake news that blurs the lines of reality and holds the power to sway public opinion. To counter the impact of fake news, several research groups have developed novel algorithms that could fact check news as a human would do. Unfortunately, natural language processing (NLP) is a complicated task because of the underly- ing hidden meanings in human communication. In this paper, we propose a novel method that builds a latent representation of natural language to capture its underlying hidden meanings accurately and classify fake news. Our approach connects the high-level semantic concepts in the news content with their low-level deep representations so that the complex news text consisting of satire, sarcasm, and purposeful misleading content can be translated into quantifiable latent spaces. This allows us to achieve very high accuracy, surpassing the scores of all winners of the fake news challenge. Keywords-Fake news; latent representation; VAE (Varia- tional AutoEncoder); LSTM (Long Short Term Memory); I. I NTRODUCTION In the recent years, “fake news” have been a buzzword used as a descriptor for publications or statements. These publications contain false or highly misrepresented facts used to manipulate people’s opinions or perceptions. These facts are quickly spread through social networks such as Facebook, Twitter, and Reddit, where users do not research the truthfulness of the articles [1], as seen in Figure 1. Since language and communication are fluid concepts, content and meaning of information can be conveyed and in- terpreted in a variety of ways depending on the reader/writer. Even words that have the same dictionary definition can carry different connotations depending on the context. As noted in [2], there are other challenges on how meaning is extracted from texts such as semantical syntax (gram- mar), morphology (pre/suffixes), and pragmatics (domain knowledge to interpret isolated pieces). As described in [3], sarcasm is a form of “indirect speech” that becomes difficult to identify and classify because it forms the basis on com- mon idioms, play-on-words, and paradoxes. Sarcasm plays an important role in identifying fake news as it represents opinionated pieces of writing having higher potential of distorting or misrepresenting factual information. Thus, it Figure 1. Evolution of Fake News on Social Media is necessary to model a statistical method wherein rather than accounting for detail-sensitive rules, we enlist broader directives and let the sampling of clustered data points handle disambiguation [4]. The remaining of the paper is organized as follows. Sec- tion II covers the existing methods in fake news detection. The proposed framework and its underlying methods are presented in Section III, followed by the experiment and results in Section IV. Finally, Section V concludes the paper with discussions about future aspects. II. PREVIOUS WORK In many real-world multimedia applications such as news, blogs, and social media, large quantities of unlabeled data are generated everyday, giving rise to the challenging se- mantic gap problem [5]–[12] which is to reduce the gap between high level semantic concepts and their low level features [13]–[17]. Despite rigorous research endeavors, this remains one of the most challenging problems in information sciences, particularly for text data. Moreover, the majority of the cases belong to only a few classes (i.e., the majority classes) and far fewer data instances belong to the minority classes. The minority classes, however, often represent the unusual and interesting events with high entropy values [18]–[20]. In content-based information retrieval applica- tions, most classifiers are modeled by exploring data statis- tics. Hence, they may be biased towards the majority classes