Semantic Routing for Effective Search in Heterogeneous and Distributed Digital Libraries Federica Mandreoli * , Riccardo Martoglia * , Wilma Penzo † , and Simona Sassatelli * * DII – University of Modena and Reggio Emilia, Italy {fmandreoli,rmartoglia,sassatelli}@unimo.it † DEIS – University of Bologna, Italy {wpenzo}@deis.unibo.it Abstract— Next generation Digital Libraries (DLs) will offer an entire ensemble of systems and services designed to help users to easily find and access the information they are looking for. However, much work is still required in order to achieve this vision. In this paper, we concentrate our attention on devising techniques allowing an effective routing of queries, which we think can be of the utmost importance in providing effective and efficient querying in heterogeneous and distributed DLs, identifying the best ways to navigate the available nodes and, thus, the documents (or their parts) which are most suitable to best answer the user needs. We describe a routing mechanism, which we call routing by mapping, in which the query is sent to the DL peers whose subnetworks best approximate the concepts required. To this end a distributed index mechanism is adopted, which we call Semantic Routing Index (SRI). We also present some exploratory experiments showing the effectiveness of the proposed approach. I. I NTRODUCTION In recent years, the constant integration and enhancements in computational resources and telecommunications, along with the considerable drop in digitizing costs, have fostered development of systems which are able to electronically store, access and diffuse via the Web a large number of digital documents and multimedia data. In such a sea of electronic information, the user can easily get lost in her/his struggle to find the information (s)he requires. For these reasons, the concept of Digital Library (DL) has become a pivotal one: exactly as a physical library, a DL contains a collection of documents that are at the users’ disposal. The most advanced DLs becoming available today have the following features, among others: (i) documents (textual documents or even metadata on multimedia items) are not limited to free text, but are most likely also expressed in semistructured formats, such as XML associated to XML Schemas; (ii) they come from different sources, usually available on the web, and are heterogeneous for what concerns the structures adopted for their representations but related for the contents they deal with; (iii) the underlying architecture is more and more often distributed over a number of nodes (peers), each one, for instance, managing specific document collections. Along with the documents themselves, a good next gen- eration DL should offer an entire ensemble of systems and services designed to help users to easily find and access the information they are looking for. Indeed, querying and accessing distributed and heterogeneous DL information in an effective and efficient way requires to devise a whole series of techniques in several synergic areas. Consider for instance Figure 1 as a sample scenario of a portion of a distributed DL containing data about publications. Each peer composing the DL network (“DL Peer” in the picture) is enriched with a schema that represents the peer’s domain of interests, and semantic mappings, represented as grey bold lines, are locally established between peers’ schemas [1], [2], [3]. In order to query a peer in the DL, its own schema is used for query formulation and mappings are used to reformulate the query over its immediate neighbors, then over their immediate neighbors, and so on. Thus, query answers can come from any peer in the DL that is connected through a semantic path of mappings [4]. In such a setting, effectively answering a query means propagating it towards the peers which are semantically best suited for answering the user needs. However, it is not always convenient for a peer to propagate a query towards all other peers. In particular, a query posed over a given DL peer should be forwarded to the most relevant peers that offer semantically related results among its immediate neighbors first, then among their immediate neighbors, and so on. As an example, let us consider the following query, posed on the schema of peer A: “Retrieve the titles of the scientific publications of author XY”. The peer A’s neighbors peer B and peer C are very similar as to the portion of the schemas involved in the query above; as to the second step of query reformulation, peer E is more relevant than peer D and peer F, since it deals with scientific publications, instead of magazines and newspapers. For these reasons, the answers obtained from path peer C - peer E fit better the query conditions than those from paths peer B - peer D - peer F and peer B - peer F. In this paper, we concentrate our attention on devising techniques allowing an effective routing of queries in a dis- tributed environment, which we think can be of the utmost importance in providing effective and efficient querying in next generation DLs, identifying the most relevant documents (and documents’ portions) in their network. We describe a routing mechanism, which we call routing by mapping [5], in which the query is sent to the peers whose subnetworks best approximate the concepts required. To this end a distributed index mechanism is adopted: each peer in the DL owns a Semantic Routing Index (SRI) which summarizes the ability of its subnetworks to semantically approximate the concepts