Belief Index for Fake COVID19 Text Detection S. Panja , A. P. James Centre for Artificial General Intelligence and Neuromorphic Systems(NeuroAGI), Indian Institute of Information Technology and Management - Kerala(IIITM-K); Email: apj@ieee.org Abstract—An increase in news articles on various communica- tion platforms or social media has resulted in higher possibilities of spread of non-factual or fake information. The overall volume and veracity of news makes it even more impossible to manually fact check each data and label them as true or false. Under such circumstances, we propose a belief index generator model that quantifies the belief to be associated with any random information making use of text analytic proximity measures. In the initial feature engineering, we use a modified TF-IDF algorithm. Post generation of word embeddings, various distance measures have been proposed and compared as possible belief scores. The analy- sis has been carried out using 50K research articles on CoVid-19 to validate truths and The CoronaVirusFacts/DatosCoronaVirus Alliance Database to validate falsities in random CoVid related information. Index Terms—Fake news, Belief Index, Cosine Similarity, TF- IDF, Jaccard Similarity, CoVid-19 I. I NTRODUCTION Fake news detection has been so far carried out[1] by a two step process after the preliminary pre-processing. The primary step has been feature engineering which has mostly included generation of word embeddings from raw text using text vectorization techniques as Word2Vec [2], FastText, TF- IDF, and GloVe. This has been followed by text classification using models mainly trained upon labelled data. The vari- ations in vectorization procedures and classification models implemented in the task and their combinations have resulted in different accuracies, given that the labelled dataset stands the same. However, the consistent difficulty that remained throughout was the impossible labelling of an almost infinite bulk of data with ’true’ or ’false’ tags for the model to objectively and most correctly predict the tag corresponding to any text information supplied. The approach here has been the discrete and binary labelling of text. And the missing point in the whole process had been the scope of an effective use of the distance metric based on which the classification was done. We put forward, a three step implementation in which post feature engineering[3] and classification phases, there is a calibration of similarity between vector corresponding to supplied text and the nearest neighbour already identified during the classification step. Experimentally we noticed that for any supplied text the nearest neighbour i.e. one with the least Euclidean distance has the maximum similarity score. Therefore the solution was seen as picking up the maximum similarity score out of similarity measures calculated for all pairs of the supplied text and each of the labelled information. Since this max score signified the closeness of the given information to an already labelled one, the degree of that closeness could be safely considered the amount of belief or disbelief (depending upon the classification) that could be bestowed upon that random text. Thus adding to a discrete tag of truth or falsity, we quantify the possibility of the information towards the tag, which mostly solves the requirement of labelling of every possible data to reach to a deterministic conclusion. For creation of word embeddings we use TF-IDF Vectorizer and in generation of belief scores we use Cosine Similarity. II. TERM FREQUENCY -I NVERSE DOCUMENT FREQUENCY (TF-IDF) VECTORIZER Term frequency (TF) is a measure of how important[4] a term is to a document. The ith term’s tf in a document j is defined as: tf i,j n i,j ř k n k,j (1) where n i,j is the number of occurrences of the term in document d j and ř k n k,j is the number of occurrences of all terms in document d j . The inverse document frequency (IDF) is a measure of the general importance of the term in a corpus of documents, calculated by dividing the number of documents by the number of all documents containing the term. In case of a large corpus, the IDF value explodes, so taking the logarithm dampens the effect: idf i log |D| |d j : t i P d j | (2) where |D| is total number of documents in the corpus and |d j : t i P d j | is the number of documents containing the term t i . Then tf idf i,j tf i,j ˚ idf i . (3) Thus for all documents in the corpus, each term is assigned a tf idf score[5][6]. Quite clearly, occurrence of a word in a document and occurrence of the word in the corpus are respectively directly and inversely proportional to its tf idf score particular to a document. Therefore it points out specific importance[7][8] of that word for the document, and it is better if it is not or less used anywhere else. Consequently we consider an array of size fixed to that of entire corpus of words in all the documents. Each block in the array is a place to hold tf idf scores of static words, with respect to dynamic documents. Resultant array thus formed gives the word embedding or vector representation of the document.