Arabian Journal for Science and Engineering
https://doi.org/10.1007/s13369-018-3232-0
RESEARCH ARTICLE - COMPUTER ENGINEERING AND COMPUTER SCIENCE
Multi-corpus-Based Model for Measuring the Semantic Relatedness in
Short Texts (SRST)
Reem El-Deeb
1
· Aya M. Al-Zoghby
1
· Samir Elmougy
1
Received: 17 October 2017 / Accepted: 26 March 2018
© King Fahd University of Petroleum & Minerals 2018
Abstract
Semantic Relatedness (SR) defines a relation between linguistic items. These items could be words, phrases, or documents.
There are many interesting related applications such as information extraction, words sense disambiguation, text summa-
rization, and text clustering. The task of quantifying SR manually is fairly natural and axiomatic, whereas it is complex
automatically because of human’s background experience and external domain concepts that are not available for the com-
putational methods. This paper focuses on the Semantic Relatedness in Short Texts (SRST). A Vector Space Model—that is
based on multi-corpus—is proposed to measure the SRST. Word synonyms and anaphoric information are used to improve
the semantic representation of the document. Since the set of verses in the Holy Quran is a precious sample of the short texts.,
it is used as the main case study in this paper to measure the degree of relatedness between these verses. Experiments are
conducted where their results proved the efficiency of the proposed model in improving SR measurement. The results show
an improvement to the recall to be 60% rather than 11.3% as the best previous studies.
Keywords Text similarity · Semantic similarity · Similarity measurement · The Holy Quran · Arabic language · Short texts
relatedness
1 Introduction
Semantic Relatedness (SR) is a general concept that includes
Semantic Similarity (SS). Any two entities having “is-a” rela-
tionship are semantically similar. Semantically related enti-
ties are those which have any associative relationship
1
,[1].
However, much of the literatures use the two concepts (SS
and SR) interchangeably, but in this work it is more adapted
to use the concept SR.
The relatedness measurement process aims to quantify
how much two entities are close to each other; one entity
causes or completes the meaning of the other one or any
other associative relationships.
B Reem El-Deeb
reemm_db@mans.edu.eg; reemm.db@gmail.com
Aya M. Al-Zoghby
aya_el_zoghby@mans.edu.eg; elzoghby.aya@gmail.com
Samir Elmougy
samirelmougy@yahoo.com
1
Department of Computer Science, Faculty of Computers and
Information, Mansoura University, Mansoura 35516, Egypt
It is involved in various applications of Natural Language
Processing (NLP) and Knowledge Engineering. The related-
ness measures could be divided into syntactic or semantic
measures. Syntactic measures take into consideration the
sequencing and the order of words/characters in comparing
the linguistic units (e.g., words, sentences, paragraphs, docu-
ments). The semantic measures, on the other hand, are those
measures that overcome the limitations of the syntactic ones
by comparing linguistic units according to their semantics.
The semantic measures in general could be classified as cor-
pus based, knowledge based, or hybrid measures. The corpus
based compare the linguistic units using unstructured seman-
tic proxies, while the knowledge based use the structured
semantic proxies like ontologies to compare the linguistic
units. The hybrid measures combine the two previous mea-
sures [2].
There are many efforts that had been focused on com-
puting the textual SR to be used in a wide range of NLP
applications [3]. For example, making use of word similarity
in paraphrase identification as in [4]. Also, in text summariza-
1
For example, “car” is related to “bus,” “road” and “driving” but is only
similar to “bus.”
123