Copyright ©2017, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
DOI: 10.4018/978-1-5225-2375-8.ch004
Chapter 4
93
Issues and Challenges
in Web Crawling for
Information Extraction
ABSTRACT
Computational biology and bio inspired techniques are part of a larger revolu-
tion that is increasing the processing, storage and retrieving of data in major way.
This larger revolution is being driven by the generation and use of information in
all forms and in enormous quantities and requires the development of intelligent
systems for gathering, storing and accessing information. This chapter describes
the concepts, design and implementation of a distributed web crawler that runs
on a network of workstations and has been used for web information extraction.
The crawler needs to scale (at least) several hundred pages per second, is resilient
against system crashes and other events, and is capable to adapted various crawling
applications. Further this chapter, focusses on various ways in which appropriate
biological and bio inspired tools can be used to implement, automatically locate,
understand, and extract online data independent of the source and also to make it
available for Semantic web agents like a web crawler.
Subrata Paul
Vignan Institute of Technology and Management, India
Anirban Mitra
Vignan Institute of Technology and Management, India
Swagata Dey
MIPS, MITS, Rayagada, India