IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 _______________________________________________________________________________________ Volume: 05 Issue: 02 | Feb-2016, Available @ http://www.ijret.org 245 META-DOCUMENTS AND QUERY EXTENSION TO ENHANCE INFORMATION RETRIEVAL PROCESS Mounira Chkiwa 1 , Jedidi Anis 2 , Faiez Gargouri 3 1 University of Sfax, Multimedia Information Systems and Advanced Computing Laboratory, Tunisia m.chkiwa@gmail.com 2 University of Sfax, Multimedia Information Systems and Advanced Computing Laboratory, Tunisia anis.jedidi@isimsf.rnu.tn 3 University of Sfax, Multimedia Information Systems and Advanced Computing Laboratory, Tunisia faiez.gargouri@isimsf.rnu.tn Abstract In this paper, we present two facets indispensable for the efficiency of our information retrieval system: meta-documents and query extension. Meta-document represents a structure used to annotate our web documents collection and query extension represents an automatic process aimed to enhance user query expression by additional terms. The two facets are indirectly related since added terms to a given query are taken from an ontology based on meta-documents. The cooperation between meta- documents and query extension aims to have an enhanced information retrieval process. In this paper, we present our proposition particularity and its evaluation results which show its efficiency. Keywords: Information Retrieval, Meta-Document, Annotation, Query, Semantic Extension, OWL Ontology, Semantic Proximity. --------------------------------------------------------------------***---------------------------------------------------------------------- I. INTRODUCTION In the information retrieval context, two structures must be consistent to lead to the success of the process: the set of keywords chosen by the user to express his information need and the set of keywords chosen by the system to annotate each document of its collection. The matching between the two structures is extremely important since it leads to get or not relevant results. The rest of paper is organized as follows: in the next section we present the web documents annotation by meta-document, section 3 presents our method of query extension, section 4 presents the contribution evaluation using standards measures, this evaluation represents a closing of the development of our information retrieval system presented in [1, 2 and 3]. Section 5 presents some related works and our particularities. Finally section 6 concludes the paper. II. META-DOCUMENTS In the web context, the document representation means how to highlight most meaningful web page parts. In our work, this operation is called annotation and it aims moreover to sort those meaningful found parts by importance in an independent structure called meta-document. The matching between a query representation and the set of meta- documents aims generally to find the most relevant documents to a query. This operation is generally called querying and it follows generally a matching algorithm which represents the particularity of an information retrieval system. We highlight in this section the meta-documents automatic creation and the exploitation of this structure in querying process. A. Meta- Document creation The annotation process allows in our system to automatically generate a meta-document descriptive of a document. In order to keep only significant terms that can describe a web document, we start by scanning the content of the target document and eliminating empty words. Empty words or stop words usually refer to the most common words in a language such as (the, is, or, by, in, with…). A stop word often has a high frequency of use for all documents to be annotated, but also a low semantic value. Since we are interested in French and English documents, our elimination process is based on a universal anti- dictionary called "stop-words-collection" [4] covering 671 English terms and 463 French terms to be eliminated. After running the stop words elimination process, we pass to the meta-document generation which aims to annotate the current web. A meta-document is composed by the source page content sorted in three lists of terms called "very important, "important" and "normal": 1. The very important list: this list contains some tags contents such as <title>, <meta name="keywords" ....> <h1> and <h2>. In addition, this list contains the 5% most frequent terms in the document. 2. The important list: this list covers the content of the following tags : <h3><h4><h5> <b> <strong> <meta name= "abstract" … > and <meta name= "description" …>. In addition, this list contains the top 10% of the rest most frequent terms.