978-1-61284-848-8/11/$26.00 ©2011 IEEE Supported by the National Basic Research Program of China (973 Program) (No. 2010CB951603) An integrated Framework for Retrieving and Analyzing Geographic Information in Web Pages Hao Lin, Longping Hu, Yingjie Hu, Jianping Wu, Bailang Yu* Key Laboratory of Geographic Information Science, Ministry of Education, East China Normal University, Shanghai 200062, P. R. China * Corresponding Author. Email: blyu@geo.ecnu.edu.cn AbstractMost of the information stored in web pages contains geographic context, such as place names, address, and coordinates. Such geographic information often contains great values and is worth retrieving and analyzing. Traditional search engine is limited in its capability of extracting meaningful geographic messages from mass of unstructured and textual source. Even with the retrieved geographic information, specific systems are still needed to show the spatial distributions of such data. This paper presents an integrated framework which can retrieve and analyze geographic information in web pages. This framework integrates the following core functions: geographic information retrieval; geocoding; spatial analysis and statistical analysis. We also demonstrate the effectiveness of this framework by employing it to retrieve and analyze the geographic information from a particular website. Keywords-web pages; geographic information retrieval; spatial analysis I. INTRODUCTION In recent years, we witnessed a considerable increase in the application of geographic information in disaster prevention and control and regional economic planning. It is vital to keep the relevant database up to date for practical applicability. Since the Internet is a great resource of frequently updated information with geographic components, it is natural to be targeted for data mining. However, the information retrieved from web is usually of textual format. As a cartographical principle, geographic information is better understood in visual format rather than in textual format. Often presented as thematic maps and statistics images, this visual format is commonly used to reveal spatial patterns, inspire spatial associations, and support spatial decision making. Since the geographic information contained in web pages has such great meaning and the visual format is indispensable, it’s significant to establish an integrated framework which can retrieve geographic information from the web, convert it into visual format for further analysis and support limited spatial analysis functions. Then, there are three difficult points have to be overcome. First is the accuracy of retrieved geographic information. Second is converting the derived geographic location information to GIS vector data. The last one is the integration of GIS spatial analysis functions. For retrieving geographic information, the concept of geographical information retrieval (GIR) has been proposed as the extension of the field of information retrieval (IR) which featured in the use of an information dictionary [1]. It is concerned with improving the quality of geographically specific information retrieval with a focus on access to unstructured documents such as those found on the Web and a number of researches have been conducted to improve the performance of geographic information retrieval [2]. The location names in web contents are focused in one research because these names often have metonymic meaning different from their literal geographic sense (e.g. Yesterday, Seoul and Peking agreed to start diplomatic relations) [3]. Another research focused on the errors in Natural Language Processing (NLP), and how these errors may influence the performance of GIR [4]. These methods of semantic analysis mentioned can improve the accuracy indeed but they restrict the efficiency of searches when they are practically executing. To avoid complicated semantic analysis and improve the accuracy, the method of professional information collection systems which based on the HTML tags analysis has been introduced into this paper [5]. In addition, a geographic information dictionary is established for supplementary and the target web sites are limited into specific ones such as professional sites with geographic coordinates, yellow pages-like sites and search engine results pages (SERP). Data gathered from the web above is usually unstructured and textual, whereas GIS works only on well-structured and numerically coded data supplied by a spatial database. In order to realize the conversion from text to map, geocoding-the process of associating an address record with a point on a map-is necessary. Gerard Rushton summarized three major methods of geocoding. First method assigns an observation to a geographic unit which is called as address conversion and landmarks conversion in this paper. The second and third methods (interpolation and parcel matching) attach a point coordinate value to a record [6] which is called as latitude and longitude conversion. All these methods are realized aim at the conversion of different types of specific web sites. In addition, the accuracy of geocoding is determined by the input address-based data quality. The paper by Zandbergena discussed the geocoding quality by using different street network