Semi-automatic Metadata Extraction from Imagery and Cartographic data Laura Díaz, Cristian Martín, Michael Gould, Carlos Granell Centre for Interactiva Visualization (CeVI) Universitat Jaume I Castellón, Spain laura.diaz@uji.es Miguel Angel Manso Department of Topographic and Cartographic Engineering Universidad Politécnica de Madrid Madrid, Spain m.manso@upm.es Abstract— Metadata are necessary to allow discovery and description of data and service resources within a Spatial Data Infrastructure, however current manual metadata editing workflows are tedious and under-utilized. We discuss on-going developments for semi-automatic metadata extraction from well- known imagery and cartographic data sources, being implemented within an open source software project in Spain. Internal metadata are collected automatically and the user can then choose to add external metadata, and to publish the final metadata record to catalogues. The next step will be to extract implicit metadata using Google-like methods. Keywords-metadata, information retrieval, open source, Spatial Data Infrastructures I. INTRODUCTION Metadata is a key element for allowing optimal data fusion and discovery, and for Spatial Data Infrastructures (SDI) [1, 2] to operate properly. Most Earth observation (EO) image processing software is equipped to read and exploit the image header file, a common type of internal metadata, to learn more about the image characteristics (size, coordinate system, pixel resolution, etc.) and, thus, properly visualize and process it. However, these internal metadata are normally not compete enough to assist the human user in judging whether the image is useful or not, covers the proper geographic area (bounding box is not enough), has important cloud cover, etc. It is normally the user who must manually create these external metadata, which are necessary to be able to publish, search for, and facilitate access to that data product in an SDI. Creating the documentation describing who created the data, where they can be found, what geographic places they cover, their general description or abstract, etc., is a laborious task, in part because users often possess data created by other parties and, so, it can be difficult to locate the original or any knowledgeable sources for some key metadata elements. This documentation process should be automated to the extent possible, given that informatics technology has greatly improved since the early days of the digital libraries that gave birth to the current manual metadata creation methodology. This process is currently undertaken using simple text editors and outside of the GIS or image processing workflow. The metadata problem is greater still among users of digital cartographic (vector) data, because GIS software managing these data has less of a tradition of including metadata in a header-like file (an exception being so-called world files, containing coordinate reference information) and is often even less capable than image processing software with regard to metadata extraction and exploitation. A few proprietary metadata extraction solutions have appeared, however in most cases their workflow is restricted to creation and cataloguing using client and server software from the same commercial family, whereas SDI related initiatives such as INSPIRE [3] and GMES [4] are promoting heterogeneity and interoperability, making the availability of open source solutions all the more attractive. Recently, Google and several multimedia information retrieval projects have demonstrated that data resources may be encountered without the need for tedious manual data product documentation, thanks to intelligent methods for intuitive metadata extraction from the data source. This is the direction we have chosen to follow [5]. II. METADATA CREATION PLATFORM A large migration project from proprietary to free software, initiated in 2004 by the Valencia regional government (Generalitat Valenciana), has produced a client software product called gvSIG (www.gvsig.gva.es) [6]. What began as a simple, Java-based GIS client quickly evolved to become, at its version 1.0, a full-function SDI client, implying that it facilitates discovery and sharing of geospatial data in addition to local geoprocessing. With this SDI-based data sharing in mind we have designed a gvSIG extension to semi- automatically extract metadata from well-known geodata formats (GeoTIFF, Shapefile, etc.), that is, at the dataset (layer) level. The idea, as stated previously, is that GIS/SDI users are given the ability to document their new data resources at the time of creation, or at least while viewing and utilizing the data, directly within the normal workflow without the