Extracting Route Directions from Web Pages Xiao Zhang , Prasenjit Mitra , Sen Xu , Anuj R. Jaiswal , Alex Klippel , Alan MacEachren Department of Computer Science and Engineering College of Information Sciences and Technology Department of Geography the Pennsylvania State University xiazhang@cse.psu.edu, {pmitra, ajaiswal}@ist.psu.edu, {senxu, klippel, maceachren}@psu.edu ABSTRACT Linguists and geographers are more and more interested in route direction documents because they contain interesting motion descriptions and language patterns. A large num- ber of such documents can be easily found on the Internet. A challenging task is to automatically extract meaningful route parts, i.e. destinations, origins and instructions, from route direction documents. However, no work exists on this issue. In this paper, we introduce our effort toward this goal. Based on our observation that sentences are the ba- sic units for route parts, we extract sentences from HTML documents using both the natural language knowledge and HTML tag information. Additionally, we study the sentence classification problem in route direction documents and its sequential nature. Several machine learning methods are compared and analyzed. The impacts of different sets of features are studied. Based on the obtained insights, we propose to use sequence labelling models such as CRFs and MEMMs and they yield a high accuracy in route part extrac- tion. The approach is evaluated on over 10,000 hand-tagged sentences in 100 documents. The experimental results show the effectiveness of our method. The above techniques have been implemented and published as the first module of the GeoCAM 1 system, which will also be briefly introduced in this paper. 1. INTRODUCTION Descriptions of motion, such as route directions in text corpora provide important information and have fascinated researchers for a long time. Since 1970s, linguists and ge- ographers have used route directions to study human spa- tial cognition, geo-referencing, analyzing route characteris- tics and building databases of linguistically characterized 1 Geographic Contextualization of Accounts of Movement. http://cxs03.ist.psu.edu:8080/GeoCAMWeb/ Copyright is held by the author/owner. Twelfth InternationalWorkshop on the Web and Databases (WebDB 2009), June 28, 2009, Providence, Rhode Island, USA. . movement patterns [8]. As web technology thrives, a large number of route direction documents have been generated and are available on the Internet. A business, organization or institution usually provides human-generated direction information on its web site to give instructions to travellers from different places to arrive there. Such direction web pages contain both meaningful route parts as well as addi- tional contents irrelevant to finding ones way (e.g., adver- tisement, general descriptions). Although humans mostly manage to follow these route directions, such manual tech- niques do not scale to a large corpora of documents. Deal- ing with real-world corpora requires a scalable information system that can automatically detect and extract route di- rections in web pages. A challenging task in building such a system is to extract meaningful route parts, namely desti- nations, origins and instructions (or actions) from contents other than route directions. In addition to linguistic use, such human-generated route direction information, if extracted, can be used as supple- mentary information for auto-generated directions, such as Google Maps 2 . Human-generated directions frequently use more obvious landmarks than street names as decision points, for example, ”turn right at the McDonald’s”, which are more helpful for users. They also provides additional information for correction or re-direction, for example, ”if you see the school, you’ve gone too far”. Such information is important for users to find their ways but is currently missing in auto- generated directions. Once the route directions in human- generated texts can be extracted, they can be incorporated with map products and better serve the users. In route direction web pages, destination refers to the location where the route ends, usually the business, orga- nization or institute hosting the web site, e.g. “Directions to the Campus”. Origin specifies the starting point of the route and helps travellers to choose which set of instructions they should follow in order to arrive to the destination, e.g. “From New York”. Instructions are a set of actions to fol- low at specified landmarks or decision points such as high- ways or intersections, e.g. “Merge onto US-220 S toward US-322 ”. In direction web pages, route parts are expressed in the form of a complete sentence, an independent phrase or a single word. We will refer to them as “sentence” in the rest of the paper. The automatic route part extraction proposed in this paper has the goal to classify sentences into 2 maps.google.com