AUTHOR COPY Intelligenza Artificiale 6 (2012) 121–133 DOI 10.3233/IA-120034 IOS Press 121 Natural language interaction with the web of data by mining its textual side Elena Cabrio a,* , Julien Cojan a , Alessio Palmero Aprosio b,c and Fabien Gandon a a INRIA Sophia Antipolis, Sophia Antipolis, France b FBK, Povo-Trento, Italy c Universit` a degli Studi di Milano, Milano, Italy Abstract. The Semantic Web is an extension of the classical web. The data and schemas it adds coexist with the documents that were already linked and available. This not only allows interoperability, reusability and potentially unforeseen applications of opened data, but it also creates a unique situation of availability on the web of huge collections of the same pieces of information available at the same time as text and as structured data. An interesting example is the couple Wikipedia-DBpedia: exploiting these interlinked structured and unstructured data sources in parallel can offer a great potential for both Natural Language Processing and Semantic Web applications. Starting from these observations, this paper addresses the problem of enhancing interactions between non-expert users and data available on the Web. In particular, we present QAKiS, a system for open domain Question Answering over linked data, that addresses the problem of question interpretation as a relation-based match, where fragments of the question are matched to binary relations of the triple store, using relational textual patterns automatically collected. In the current version, the relational patterns are automatically extracted from Wikipedia, while DBpedia is the data set to be queried using a natural language interface. Keywords: Question answering, linked data, Wikipedia, DBpedia 1. Web of documents and web of data: Jointly exploiting two facets of the web In the early days of the web this child of hypertext systems and network applications was often presented using the metaphor of a huge online library of inter- linked documents. This documentary facet of the web is the most ancient and most persistent perception of it, and sometimes still hides the true nature of the web, which is in fact a huge network of computational resources, algorithms and data. Among the side effects of this biased perception of the web, for a long time * Corresponding author. Elena Cabrio, INRIA Sophia Antipolis, 2004 Route des Lucioles BP93, 06902 Sophia Antipolis, France. E-mails: elena.cabrio@inria.fr (Elena Cabrio); julien.cojan@inria.fr (Julien Cojan); fabien.gandon@inria.fr (Fabien Gandon); aprosio@ fbk.eu (Alessio Palmero Aprosio). Natural Language Processing (NLP) community only saw the web as a huge corpus (e.g. [20, 19]), where data are stored in standard HTML documents, and presented to the users in human readable representations rendered through the web technical stack and architecture. But in fact one could argue that there are no documents on the web and there never were. We only have ephemeral rep- resentations transferred through networks and produced on demand by distributed programs using distributed data. What is now called the web of data can been seen as the first wave of the deployment of the Semantic Web (SW), and makes very visible the fact that the web is not only publishing linked documents, but also linked data and linked applications. The standardization of the frameworks of the semantic web aims at transforming the access to information by adding machine-readable 1724-8035/12/$27.50 © 2012 – IOS Press and the authors. All rights reserved