INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME 8, ISSUE 10, OCTOBER 2019 ISSN 2277-8616
3726
IJSTR©2019
www.ijstr.org
Scientific Document Retrieval System Using
Signature Based Hash Indexing
Sourish Dhar, Sudipta Roy
Abstract: Scientific documents and magazines involve large number of mathematical expressions and formulas along with text. The continuous growth
of such documents necessitates the requirement of developing specialized tools and techniques, which could handle and analyze mathematical
expressions and formulas. Mathematical expressions and formulae are highly structured and quite different from traditional text. Due to which
conventional text retrieval system performs poorly in retrieving scientific documents based on mathematical expression formulated as a query.
Mathematical information retrieval is concerned with finding information in documents that include mathematics. To address the challenges posed by
mathematical formulae as compared to text, this paper aims to construct a math aware search engine, which can retrieve relevant scientific documents
based on a mathematical query. A novel signature based hashing scheme to index raw mathematical web documents is proposed in this paper, which
can also take mathematical notational equivalences into account. The proposed system demonstrates better precision and stability of the ranked results
when compared with other related state-of-the-art math aware search engines.
Index Terms: Mathematical Information Retrieval, Structure Encoded String, Signature Hashing, Formula Search Engine
Presentation MathML
————————————————————
1 INTRODUCTION
Mathematics is a very important constituent in the domain of
Science, Technology, Engineering and Mathematics (STEM).
Its very need is felt in different spheres of research, education
and industries. There would be a seldom scientific document
without a single mathematical expression (ME)/symbol. In this
digital era, with more and more scientific documents being
generated, information explosion indeed was inevitable. To
store, manage and retrieve this vast amount of scientific
documents thereby mathematical expressions novel
strategies, principles and tools were developed in the last
decade. The domain of information retrieval (IR) began from
early 1950; as a result many IR models are into existence now
namely Boolean Model, Vector Space Model (VSM),
Probabilistic model etc. However, vector representation does
not consider the ordering of words in a document that is a
crucial factor for MEs and exact matching may retrieve too few
or too many documents [1,2]. The field of IR has been
exhaustively explored for many decades but a distinct focus is
required for Mathematical Information Retrieval (MIR) because
conventional text retrieval systems are not suitable for
retrieving mathematical expressions [3,4]. As stated in [5]
―Mathematical Information Retrieval is concerned with finding
information in documents that include mathematics. This is
important for technical disciplines that use math frequently.
(e.g. Physics and Computer Science). Mathematical
Information Retrieval (MIR) systems are formula based search
engine. User information needs requires careful investigation
and good understanding to develop firm principles and
foundations in the area of MIR systems.‖ The order of the
terms in a mathematical expression (ME) is crucial issue which
influence the semantics of a ME but presently in most of the
existing text-based MIR sytems bag-of-words approach have
been implemented as a result the order of the terms
consequently, structure of a ME get lost. Furthermore, with the
aforementioned approach most of the MIR systems have used
inverted index with tf-idf ranking. Therefore, this paper
proposes an alternative indexing scheme i.e. signature based
hash index for mathematical information retrieval while
constructing a math-aware search engine: SigMa. Moreover,
we also extend the concept of structure-encoded strings (SES)
for MathML documents to eliminate extraneous sysmbols like
<mi>, <mo> etc. without losing the structure of a ME. This
paper constructs a math aware search engine with an an
alternative approach for indexing that is based on signature
hashing along with the implementation of structure-encoded
strings for mathematical expressions extended for MathML
documents. The reason to use an alternative approach was
motivated by the fact that most of the systems disucssed
above have used a bag-of words approach along with tf-idf
scores . The major bottleneck with this approach is the loss of
order, thereby the whole structure which is a crucial aspect of
a ME. The paper is categorized as follows. Related works is
discussed in section 2. In Section 3 and its subsections, we
discuss our proposed approach. Section 4 discusses about
the experiment results and in section 5, we conclude our
discussion.
2 Preliminaries and Related Work
Classically information retrieval (IR) models can be classified
into three broad categories namely set-theoretic, algebraic and
probabilistic models [1,6].
2.1 Set Theoretic Model
Documents are modeled as sets depending on the terms that
it contains. Thereafter, the standard set-theoretic operations
are used to derive the similarities. Based on the foundations of
set theory and boolean algebra, Standard Boolean Model was
derived where connectives like ^, _, ¬ etc. are used to issue
the query in conjunction with the key terms [7]. Although being
a very simple and efficient model to implement, it also has
some limitations. Firstly, it fails to retrieve results with partial
match and secondly general users find it very difficult to form
complex queries. Due to these reasons, its performance
results in either high precision and low recall or low precision
and high recall. The strict Boolean and fuzzy-set models are
preferable to other models in terms of computational
requirements [8].
————————————————
Sourish Dhar is currently working as Assistant Professor in the
Department of CSE at Assam University, Silchar, IndiaPH-9435177322.
E-mail: dharsourish@gmail.com
Sudipta Roy is currently working as Professor in the Department of
CSE at Assam University, Silchar, India, PH-9864311494. E-mail:
sudipta.it@gmail.com