INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME 8, ISSUE 10, OCTOBER 2019 ISSN 2277-8616 3726 IJSTR©2019 www.ijstr.org Scientific Document Retrieval System Using Signature Based Hash Indexing Sourish Dhar, Sudipta Roy Abstract: Scientific documents and magazines involve large number of mathematical expressions and formulas along with text. The continuous growth of such documents necessitates the requirement of developing specialized tools and techniques, which could handle and analyze mathematical expressions and formulas. Mathematical expressions and formulae are highly structured and quite different from traditional text. Due to which conventional text retrieval system performs poorly in retrieving scientific documents based on mathematical expression formulated as a query. Mathematical information retrieval is concerned with finding information in documents that include mathematics. To address the challenges posed by mathematical formulae as compared to text, this paper aims to construct a math aware search engine, which can retrieve relevant scientific documents based on a mathematical query. A novel signature based hashing scheme to index raw mathematical web documents is proposed in this paper, which can also take mathematical notational equivalences into account. The proposed system demonstrates better precision and stability of the ranked results when compared with other related state-of-the-art math aware search engines. Index Terms: Mathematical Information Retrieval, Structure Encoded String, Signature Hashing, Formula Search Engine Presentation MathML ———————————————————— 1 INTRODUCTION Mathematics is a very important constituent in the domain of Science, Technology, Engineering and Mathematics (STEM). Its very need is felt in different spheres of research, education and industries. There would be a seldom scientific document without a single mathematical expression (ME)/symbol. In this digital era, with more and more scientific documents being generated, information explosion indeed was inevitable. To store, manage and retrieve this vast amount of scientific documents thereby mathematical expressions novel strategies, principles and tools were developed in the last decade. The domain of information retrieval (IR) began from early 1950; as a result many IR models are into existence now namely Boolean Model, Vector Space Model (VSM), Probabilistic model etc. However, vector representation does not consider the ordering of words in a document that is a crucial factor for MEs and exact matching may retrieve too few or too many documents [1,2]. The field of IR has been exhaustively explored for many decades but a distinct focus is required for Mathematical Information Retrieval (MIR) because conventional text retrieval systems are not suitable for retrieving mathematical expressions [3,4]. As stated in [5] ―Mathematical Information Retrieval is concerned with finding information in documents that include mathematics. This is important for technical disciplines that use math frequently. (e.g. Physics and Computer Science). Mathematical Information Retrieval (MIR) systems are formula based search engine. User information needs requires careful investigation and good understanding to develop firm principles and foundations in the area of MIR systems.‖ The order of the terms in a mathematical expression (ME) is crucial issue which influence the semantics of a ME but presently in most of the existing text-based MIR sytems bag-of-words approach have been implemented as a result the order of the terms consequently, structure of a ME get lost. Furthermore, with the aforementioned approach most of the MIR systems have used inverted index with tf-idf ranking. Therefore, this paper proposes an alternative indexing scheme i.e. signature based hash index for mathematical information retrieval while constructing a math-aware search engine: SigMa. Moreover, we also extend the concept of structure-encoded strings (SES) for MathML documents to eliminate extraneous sysmbols like <mi>, <mo> etc. without losing the structure of a ME. This paper constructs a math aware search engine with an an alternative approach for indexing that is based on signature hashing along with the implementation of structure-encoded strings for mathematical expressions extended for MathML documents. The reason to use an alternative approach was motivated by the fact that most of the systems disucssed above have used a bag-of words approach along with tf-idf scores . The major bottleneck with this approach is the loss of order, thereby the whole structure which is a crucial aspect of a ME. The paper is categorized as follows. Related works is discussed in section 2. In Section 3 and its subsections, we discuss our proposed approach. Section 4 discusses about the experiment results and in section 5, we conclude our discussion. 2 Preliminaries and Related Work Classically information retrieval (IR) models can be classified into three broad categories namely set-theoretic, algebraic and probabilistic models [1,6]. 2.1 Set Theoretic Model Documents are modeled as sets depending on the terms that it contains. Thereafter, the standard set-theoretic operations are used to derive the similarities. Based on the foundations of set theory and boolean algebra, Standard Boolean Model was derived where connectives like ^, _, ¬ etc. are used to issue the query in conjunction with the key terms [7]. Although being a very simple and efficient model to implement, it also has some limitations. Firstly, it fails to retrieve results with partial match and secondly general users find it very difficult to form complex queries. Due to these reasons, its performance results in either high precision and low recall or low precision and high recall. The strict Boolean and fuzzy-set models are preferable to other models in terms of computational requirements [8]. ———————————————— Sourish Dhar is currently working as Assistant Professor in the Department of CSE at Assam University, Silchar, IndiaPH-9435177322. E-mail: dharsourish@gmail.com Sudipta Roy is currently working as Professor in the Department of CSE at Assam University, Silchar, India, PH-9864311494. E-mail: sudipta.it@gmail.com