International Journal of Computer Applications (0975 8887) Volume 55No.5, October 2012 34 Dynamic and Distributed Indexing Architecture in Search Engine using Grid Computing M. E. ElAraby Dept. of Computer Science, High Institute for Computers and Information Systems, Al Shorouk Academy, Egypt M. M. Sakre Dept. of Computer Science, High Institute for Computers and Information Systems, Al Shorouk Academy, Egypt M. Z. Rashad Dept. of Computer Science, Faculty of Computer and Information Sciences, Mansoura University, Egypt O. Nomir Dept. of Computer Science, Faculty of Computer and Information Sciences, Mansoura University, Egypt ABSTRACT Search engines require computers with high computation resources for processing to crawl web pages and huge data storage to store billions of pages collected from the World Wide Web after parsing and indexing these pages. The indexer is one of the main components of the search engine that come intermediate between the crawler and the searcher. Indexing is the process of organizing the collected data to facility information retrieval and minimizes the time of query. Indexing requires huge processing and storage resources, and the indexing has a high effect on the performance of the search engine, this effect differs based on the structure and the process index construction. Distribution of the indexing process over a cluster of computers in grid computing will improve the performance through distributing the parsing load over a number of computers in a grid environment, and distributing the indexed data over distributed memory according to terms over a number of computers remotely. Due to the search engine data collections with frequent changes, the indexer require dynamic indexing. So the merge of the distributed and dynamic indexing in architecture over grid computing will give a better performance utilizing the available resources without need to computers with high cost such as supercomputers. General Terms Grid Computing, Algorithms, Inverted Index, and World Wide Web. Keywords Indexer, World Wide Web, Search engine, Grid Computing, Web pages, Secondary index, Main index, Alchemi, Manager, and Executor. 1. INTRODUCTION One of the main sources of information is the internet which is useful for a large number of consumers. The easiest way to deal with the internet is the search engine. Search engines work as the mediators between consumers and online information to search and get information from the World Wide Web which contains many billions of web pages. So, components of search engine are an important topic of research. A search engine has at least three main components: the crawler, indexer, and searcher [1] as shown in figure 1. The search engine roll is to gathering the web pages and indexing them to retrieve easily and faster by user queries. One of the main components in search engines is the indexing which consists of steps followed to generate the indexed pages collection. Indexing is the act of classifying and providing an index in order to make items easier to retrieve. Indexing the data set enables access to large data set, and reduces the access time while query processing by the way to avoid linearly scanning the texts for each query is to index the documents in advance. The crawling in the search engine collects very large number of web pages in database, then when search query come, it will take much time to retrieve the required page, so indexing process in search engine parses, and stores data to facilitate fast and accurate information retrieval. There are many different indexing techniques and each one has different performance and speed. There are two general types of indexing full-text indexing, and partial-text indexing. Full-text indexing parses and store all words in the document so it requires more storage and increases index size, but partial-text services restrict the depth indexed to reduce index size. Popular search engines focus on the full-text indexing of online, natural language documents. [2] Indexing in search engine is one of the main components of the search engine, it is an important factor of search engine, and it is intermediate stage between crawling process and the searching, so the indexing affects the general performance, speed, and accuracy of the search engine. The performance of search engine and underlying indexing techniques is one of the factors that a critical for usability of text retrieval systems [3].