Serere et al. 110 A Comparative Study of Geocoder Performance on Unstructured Tweet Locations GI_Forum 2023, Issue 1 Page: 110 - 117 Short Paper Corresponding Author: helenngonidzashe.serere@plus.ac.at DOI: 10.1553/giscience2023_01_s110 Helen Ngonidzashe Serere 1 , Umut Nefta Kanilmaz 1 , Sruthi Ketineni 1 , Bernd Resch 1,2 1 University of Salzburg, Austria 2 Havard University, USA Abstract Geocoding is a process of converting human-readable addresses into latitude and longitude points. Whilst most geocoders tend to perform well on structured addresses, their performance drops significantly in the presence of unstructured addresses, such as locations written in informal language. In this paper, we make an extensive comparison of geocoder performance on unstructured location mentions within tweets. Using nine geocoders and a worldwide English-language Twitter dataset, we compare the geocoders’ recall, precision, consensus and bias values. As in previous similar studies, Google Maps showed the highest overall performance. However, with the exception of Google Maps, we found that geocoders which use open data have higher performance than those which do not. The open-data geocoders showed the least per-continent bias and the highest consensus with Google Maps. These results suggest the possibility of improving geocoder performance on unstructured locations by extending or enhancing the quality of openly available datasets. Keywords: commercial geocoders, natural language, Twitter, open data, spaCy 1 Introduction Geocoding is omnipresent in our day-to-day lives. Tourists searching for nearby restaurants, emergency responders wanting to locate victims (Singh et al. 2019), or planners wanting to understand traffic flows (Das & Purves, 2020), to give but a few examples, all involve geocoding. However, generating accurate results is not a simple task but depends on various factors, including, among other things, the quality of the underlying reference database and the geocoder’s robustness in dealing with natural language (Karimi, Durcik & Rasdorf, 2004). Whilst most geocoders perform well on structured addresses, their performance decreases in the presence of unstructured or partial addresses that may also include spelling, syntax or formatting errors. This paper seeks to evaluate geocoder performance on unstructured locations, specifically ones embedded within tweets. We chose to use a Twitter dataset because of the existing need to