Towards a framework for automatic geographic feature extraction from Twitter Enrico Steiger, Johannes Lauer, Timothy Ellersiek, Alexander Zipf GIScience Research Group, Institute of Geography, University of Heidelberg, Berliner Straße 48 D-69120 Heidelberg Email: {enrico.steiger; johannes.lauer, timothy.ellersiek,zipf}@geog.uni-heidelberg.de 1. Introduction Interactive social media platforms offer a tremendous amount of volunteered, user-generated content (Flickr, Twitter, etc.). Together with volunteered geographic information (VGI) they potentially provide a valuable source of information which is increasingly recognized, but particularly in GIScience not utilized to its full potential. Twitter as one location-based social network in particular, provides the ability to sense geo-processes and to gain knowledge about the individual user perception towards geographic objects. A georeferenced tweet represents a proxy of a real world observation and contains spatial, temporal and semantic information. These social sensor measurements depend on particular tweet locations and are influenced by the individual user perception of urban space. Although there is a growing research body conducting Twitter analysis, a key challenge remains whether this noisy biased data source forms a representative sample for the knowledge discovery of geographic information. Location information retrieved from Twitter data is spatio-temporally and semantically uncertain. One of the main research aims is therefore to investigate whether geographic features from tweets can be detected and extracted. Furthermore, we explore whether the inferred geometries of features match with real world spatial objects (e.g. points of interest). In this work we propose a framework to infer geographic features from unstructured georeferenced Twitter data using semantic topic modelling and spatial clustering techniques. Given the detected and extracted geographic features from Twitter, we applied a geometry computation and compared the results with map features from OpenStreetMap. 1.1 Related Work There are a number of previous studies on a macroscopic scale aiming to infer direct or indirect geographic information from Twitter using provided metadata, the semantic tweet content or geographic coordinates. Cha et al. (2010) focus on enriching georeferenced tweets by inferring the location from user profiles and in addition their social network. Gonzalez and Chen (2012), Hiruta et al. (2012) and Lee and Hwang (2012) further develop a location inference system using user profile location, semantic classified tweet content or GPS coordinates from the geotag. Hong et al. (2012) develop a location aware topic model to correlate relationships between location and words. Dalvi et al. (2012) geolocate users by matching posted tweets containing indirect spatial information to real world spatial objects. Sengstock and Gertz (2012) introduce a framework for unsupervised extraction of latent geographic features from georeferenced Flickr data. 2. Methods Tweets represent a spatio-temporal signal with a semantic information layer. We have extracted a semantic dimension over geographic space in order to infer geographical features on a small map scale (street level).