Arabian Journal for Science and Engineering https://doi.org/10.1007/s13369-018-3232-0 RESEARCH ARTICLE - COMPUTER ENGINEERING AND COMPUTER SCIENCE Multi-corpus-Based Model for Measuring the Semantic Relatedness in Short Texts (SRST) Reem El-Deeb 1 · Aya M. Al-Zoghby 1 · Samir Elmougy 1 Received: 17 October 2017 / Accepted: 26 March 2018 © King Fahd University of Petroleum & Minerals 2018 Abstract Semantic Relatedness (SR) deﬁnes a relation between linguistic items. These items could be words, phrases, or documents. There are many interesting related applications such as information extraction, words sense disambiguation, text summa- rization, and text clustering. The task of quantifying SR manually is fairly natural and axiomatic, whereas it is complex automatically because of human’s background experience and external domain concepts that are not available for the com- putational methods. This paper focuses on the Semantic Relatedness in Short Texts (SRST). A Vector Space Model—that is based on multi-corpus—is proposed to measure the SRST. Word synonyms and anaphoric information are used to improve the semantic representation of the document. Since the set of verses in the Holy Quran is a precious sample of the short texts., it is used as the main case study in this paper to measure the degree of relatedness between these verses. Experiments are conducted where their results proved the efﬁciency of the proposed model in improving SR measurement. The results show an improvement to the recall to be 60% rather than 11.3% as the best previous studies. Keywords Text similarity · Semantic similarity · Similarity measurement · The Holy Quran · Arabic language · Short texts relatedness 1 Introduction Semantic Relatedness (SR) is a general concept that includes Semantic Similarity (SS). Any two entities having “is-a” rela- tionship are semantically similar. Semantically related enti- ties are those which have any associative relationship 1 ,[1]. However, much of the literatures use the two concepts (SS and SR) interchangeably, but in this work it is more adapted to use the concept SR. The relatedness measurement process aims to quantify how much two entities are close to each other; one entity causes or completes the meaning of the other one or any other associative relationships. B Reem El-Deeb reemm_db@mans.edu.eg; reemm.db@gmail.com Aya M. Al-Zoghby aya_el_zoghby@mans.edu.eg; elzoghby.aya@gmail.com Samir Elmougy samirelmougy@yahoo.com 1 Department of Computer Science, Faculty of Computers and Information, Mansoura University, Mansoura 35516, Egypt It is involved in various applications of Natural Language Processing (NLP) and Knowledge Engineering. The related- ness measures could be divided into syntactic or semantic measures. Syntactic measures take into consideration the sequencing and the order of words/characters in comparing the linguistic units (e.g., words, sentences, paragraphs, docu- ments). The semantic measures, on the other hand, are those measures that overcome the limitations of the syntactic ones by comparing linguistic units according to their semantics. The semantic measures in general could be classiﬁed as cor- pus based, knowledge based, or hybrid measures. The corpus based compare the linguistic units using unstructured seman- tic proxies, while the knowledge based use the structured semantic proxies like ontologies to compare the linguistic units. The hybrid measures combine the two previous mea- sures [2]. There are many efforts that had been focused on com- puting the textual SR to be used in a wide range of NLP applications [3]. For example, making use of word similarity in paraphrase identiﬁcation as in [4]. Also, in text summariza- 1 For example, “car” is related to “bus,” “road” and “driving” but is only similar to “bus.” 123