International Journal of Engineering Research and Development e-ISSN: 2278-067X, p-ISSN: 2278-800X, www.ijerd.com Volume 10, Issue 4 (April 2014), PP.69-73 69 Survey paper on Generalized Inverted Index for Keyword Search Dhomse Kanchan 1 , Prof. R.L.Paikrao 2 1 M.E. ,Computer, AVCOE, Sangamner 2 H.O.D., Computer Dept., AVCOE, Sangamner ABSTRACT:- Inverted lists are normally used to index selected documents to acess the documents according to a set of keywords efficiently. Since inverted lists are normally large in size, many compression techniques have been proposed to minimize the storage space and disk I/O time.This paper presents we propose improved technologies for document identifier assignment a more convenience index structure, the Generalized Inverted Index which merges consecutive IDs in inverted lists into intervals to avoid large storage space. With this index structure, more efficient algorithms can be devised to implement basic keyword search operations. The performance of generalized inverted index is also improved by reordering the documents in datasets using two scalable new algorithms. Experiments on the performance and scalability of generalized inverted index on real datasets prove that Generalized Inverted Index requires minimum storage space as well as increases the keyword search performance, as compared to the old inverted indexes. Keywords:- keyword search; DocID Assignment, index compression; document reordering, Union Algorithm, Intersection Algorithm. I. INTRODUCTION Among the large amount of new information, keyword search is complicated for users to access text datasets. These datasets include textual documents (web pages), XML documents, and relational tables (which can also be regarded as sets of documents). Users use keyword search to retrieve documents by simply typing in keywords as queries. Current keyword search systems normally use an inverted index, a data structure that maps each word in the dataset to a list of IDs of documents in which the word appears to efficiently retrieve documents. The inverted index for a document collection consists of a set of so-called inverted lists, known as posting lists. Each inverted list corresponds to a word, which stores all the IDs of documents where this word appears in ascending order. In practice, real world datasets are so large that keyword search systems normally use various compression techniques to reduce the space cost of storing inverted indexes. Compression of inverted index not only reduces the space cost, but also leads to less disk I/O time during query processing. As a result,compression techniques have been extensively studied in recent years. Since IDs in inverted lists are sorted in ascending order,such as Such as Variable-Byte Encoding (VBE)[1] and PForDelta[2], store the differences between IDs, called d-gaps,and then use various techniques to encode these d-gaps using shorter binary representations. Although a compressed inverted index is smaller than the original index, the system needs to decompress encoded lists during query processing, which leads to extra computational costs. To solve this problem, this paper presents the Generalized Inverted Index which is an extension of the traditional inverted index (denoted by InvIndex), to support keyword search. Generalized Inverted Index encodes consecutive IDs in each inverted list of InvIndex into intervals, and adopts efficient algorithms to support keyword search using these interval lists. II. BACKGROUND A. Prior Work on Doc ID Assignment The compressed size of an inverted list, and thus the entire inverted index, is a function of the d-gaps being compressed, which itself depends on how we assign docIDs to documents (or columns to documents in the matrix). usual integer compression algorithm want fewer bits to represent ta smaller integer than a larger one, but the number of bits required is typically less than linear in the value. This means that if we assign docIDs to documents such that we get many small d-gaps, and a few larger d-gaps, the resulting inverted list will be more compressible than another list with the same average valuebut more uniform gaps. This is the insight that has motivated all the recent work on optimizing docID assignment [3]-[5] .Note that this work is related in large part to the more general topic of sparse matrix compression [6], with parallel lines of work often existing between the two fields.