978-1-61284-848-8/11/$26.00 ©2011 IEEE
Supported by the National Basic Research Program of China (973 Program)
(No. 2010CB951603)
An integrated Framework for Retrieving and Analyzing
Geographic Information in Web Pages
Hao Lin, Longping Hu, Yingjie Hu, Jianping Wu, Bailang Yu*
Key Laboratory of Geographic Information Science, Ministry of Education,
East China Normal University, Shanghai 200062, P. R. China
* Corresponding Author. Email: blyu@geo.ecnu.edu.cn
Abstract—Most of the information stored in web pages contains
geographic context, such as place names, address, and coordinates.
Such geographic information often contains great values and is
worth retrieving and analyzing. Traditional search engine is limited
in its capability of extracting meaningful geographic messages from
mass of unstructured and textual source. Even with the retrieved
geographic information, specific systems are still needed to show
the spatial distributions of such data. This paper presents an
integrated framework which can retrieve and analyze geographic
information in web pages. This framework integrates the following
core functions: geographic information retrieval; geocoding; spatial
analysis and statistical analysis. We also demonstrate the
effectiveness of this framework by employing it to retrieve and
analyze the geographic information from a particular website.
Keywords-web pages; geographic information retrieval; spatial
analysis
I. INTRODUCTION
In recent years, we witnessed a considerable increase in the
application of geographic information in disaster prevention and
control and regional economic planning. It is vital to keep the
relevant database up to date for practical applicability. Since the
Internet is a great resource of frequently updated information
with geographic components, it is natural to be targeted for data
mining. However, the information retrieved from web is usually
of textual format. As a cartographical principle, geographic
information is better understood in visual format rather than in
textual format. Often presented as thematic maps and statistics
images, this visual format is commonly used to reveal spatial
patterns, inspire spatial associations, and support spatial decision
making. Since the geographic information contained in web
pages has such great meaning and the visual format is
indispensable, it’s significant to establish an integrated
framework which can retrieve geographic information from the
web, convert it into visual format for further analysis and support
limited spatial analysis functions. Then, there are three difficult
points have to be overcome. First is the accuracy of retrieved
geographic information. Second is converting the derived
geographic location information to GIS vector data. The last one
is the integration of GIS spatial analysis functions.
For retrieving geographic information, the concept of
geographical information retrieval (GIR) has been proposed as
the extension of the field of information retrieval (IR) which
featured in the use of an information dictionary [1]. It is
concerned with improving the quality of geographically specific
information retrieval with a focus on access to unstructured
documents such as those found on the Web and a number of
researches have been conducted to improve the performance of
geographic information retrieval [2]. The location names in web
contents are focused in one research because these names often
have metonymic meaning different from their literal geographic
sense (e.g. Yesterday, Seoul and Peking agreed to start
diplomatic relations) [3]. Another research focused on the errors
in Natural Language Processing (NLP), and how these errors
may influence the performance of GIR [4]. These methods of
semantic analysis mentioned can improve the accuracy indeed
but they restrict the efficiency of searches when they are
practically executing. To avoid complicated semantic analysis
and improve the accuracy, the method of professional
information collection systems which based on the HTML tags
analysis has been introduced into this paper [5]. In addition, a
geographic information dictionary is established for
supplementary and the target web sites are limited into specific
ones such as professional sites with geographic coordinates,
yellow pages-like sites and search engine results pages (SERP).
Data gathered from the web above is usually unstructured and
textual, whereas GIS works only on well-structured and
numerically coded data supplied by a spatial database. In order to
realize the conversion from text to map, geocoding-the process of
associating an address record with a point on a map-is necessary.
Gerard Rushton summarized three major methods of geocoding.
First method assigns an observation to a geographic unit which is
called as address conversion and landmarks conversion in this
paper. The second and third methods (interpolation and parcel
matching) attach a point coordinate value to a record [6] which is
called as latitude and longitude conversion. All these methods are
realized aim at the conversion of different types of specific web
sites. In addition, the accuracy of geocoding is determined by the
input address-based data quality. The paper by Zandbergena
discussed the geocoding quality by using different street network