International Journal of Scientific & Engineering Research, Volume 6, Issue 9, September-2015 1082
ISSN 2229-5518
IJSER © 2015
http://www.ijser.org
An Arabic Web search engine using grid
computing and Artificial Intelligence techniques
Mohammed Mahmoud Ibrahim Sakre
Abstract—This research is considered as a result of an accumulative work for many years to demonstrate a model for the heavy
computational components of any World Wide Web (WWW) search engine. This architecture is based on the grid computing. The
crawling load is distributed over a set of computers to retrieve more crawled pages in less time. The proposed architecture of the indexer
distributes the indexing load over a set of computers and supports the dynamic indexing to deal with the frequent changes in the web
contents. So, the proposed architectures of the crawler and the indexer support the freshness of the web pages. The used freshness
technique is considered in the crawler and the indexer where the dynamic indexer is responsible of determining the old pages and sending
them to the crawler to revisit them for updating. The Search module is implemented including Arabic morphological analysis/generation and
synonym dictionary which are combined to produce an Intelligent Arabic Internet Search module. The use of these linguistic tools is
proved, experimentally, to have positive effects on both Precision and Recall measures where the average precision exceeds the value of
0.92. This design is implemented for Arabic language but it suits any other language with language-related modifications.
Index Terms— Grid Computing, Internet Search Engine, Crawling, Indexing, Artificial Intelligence, Natural Language processing.
—————————— ——————————
1 INTRODUCTION
cently, the World Wide Web (WWW) is one of the main
sources of information for a large number of people.
WWW search engines are considered as the mediators
between online information and people. WWW search engines
require computers with high computation resources for
processing to crawl web pages and require huge data storage
to store billions of pages collected from the WWW after
parsing and indexing these pages. The proposed key for this
problem is offered by the use of the Grid Computing. The
Grid computing term recognized in the mid of 1990s and it
refers to a proposed distributed computing infrastructure
[1,2].
The typical design of any search engine consists of three
stages in which a Web crawler creates a collection of pages
which is indexed and searched. This model, in which
operations are executed in strict order: first (Crawling, then
indexing as pre-processing phases), and then (searching as a
run-time phase) is explained in figure (1). [3,4].
The crawling starts with a set of URLs to fetch their pages
and parses them to extract the new URLs exist in these pages.
Each extracted URL is either a new discovered URL which
should be visited next [4], or an old URL for which the weight
of its page should be increased. This will affect the page rank
during the searching stage.
The indexing stage operates on the pages collected during the
crawling stage. It parses the pages and generates the inverted
index as has been described in a previous research of the
author and others [5].
The searching stage gets answers to the users' queries based
on the non-stop words of the query terms. Freshness of the
web pages is an important factor that affects the efficiency of
the search engine. There are different techniques to keep the
web pages up-to-date [6].Page Ranking is the process which
estimates the quality of a set of results retrieved by a search
engine and presented to the user. Search engines have taken a
lot of effort to rank Web objects and to retrieve the correct and
desired information contained on the data bases of the WWW.
Freshness and Page Ranking topics are considered in the
proposed model of this research.
The searcher uses the indexed database to find the proper web
pages which contains an answer to the user query. The search
results are ordered according to their relativity to the query
using the page ranking parameters, calculated during the
execution of the crawler and the indexer, and presented to the
user. The search engine modules of different search engines
differ from each other by the way of working. Some search
modules use the query words as they keyed in by the users.
Another search engine give the user the ability to use Boolean
functions. However, more advanced search engines perform
some lexical and/or morphological analysis on the keywords
like the one presented in this research for Arabic language.
There are a number of research groups that have been
working in the field of distributed computing. These groups
have created middleware, libraries and tools that allow the
cooperative use of geographically distributed resources
unified to act as a single powerful platform for the execution
of parallel and distributed applications. This approach of
computing has been known by several names, such as
metacomputing, scalable computing, global computing,
Internet computing and lately as grid computing [1], [2], [7].
Alchemi system is an open source software toolkits
developed at the University of Melbourne, which provides
middleware for creating an enterprise grid computing
environment. Alchemi consists of two main components are
manager and executer. More than one computer runs the
executor program and only one computer run a manager
R
IJSER