1 Uniform Methodology for Evaluating Information Access Components of Digital Libraries Giorgio Maria Di Nunzio and Nicola Ferro Department of Information Engineering University of Padua – Italy {dinunzio, ferro}@dei.unipd.it I. I NTRODUCTION The evaluation of Digital Library Management Systems (DLMSs) is a non trivial issue that should cover different aspects, such as: the DLMS architecture, the DLMS infor- mation access and extraction capabilities, the management of multimedia content, the interaction with users, and so on. In particular, with respect to the classification proposed by [1], we are interested in the evaluation aspects concerned with the technological issues of a DLMS and, more specifically, with the information access and extraction components of a DLMS, which deal with the indexing, search and retrieval of documents in response to a user’s query. Today, the evaluation of the performances of the information access and extraction components of a DLMS is carried out in important international evaluation initiatives, such as the Text REtrieval Conference (TREC) 1 , the Cross-Language Evaluation Forum (CLEF) 2 , the NII-NACSIS Test Collection for IR Systems (NTCIR) 3 , and the INitiative for the Evaluation of XML Retrieval (INEX) 4 . All of these initiatives are based on the Cranfield methodology, which makes use of experi- mental collections [2] and measures to quantify the retrieval performances. Besides the evaluation of the performances of the single systems, another type of evaluation is the statistical analysis in order to compare performances among different components. For this reason, a statistical methodology for judging whether measured differences can be considered sta- tistically significant is needed [3]. However, the evaluation forums mentioned above are carried out in a fragmented way and for most of the time individually by each participant: each participant acquires the collection and a set of tasks, performs the tasks locally on his own sys- tem, and returns the results to the organizers of the evaluation forum. The organizers make the results of the performances and of the statistical analysis of each participant available, and participants are able to compare the results of their systems with the others. This view shows that different moments of an evaluation forum are carried out and completed separately, and the tools to analyze and compare results are usually different from participant to participant. Integrating and uniforming the activities among the different 1 http://trec.nist.gov/ 2 http://clef.isti.cnr.it/ 3 http://research.nii.ac.jp/ntcir/index-en.html 4 http://inex.is.informatik.uni-duisburg.de/ entities involved in the evaluation of the information access components of a DLMS would be of great benefit both for the organizers and for the participants. With “uniform” we intend standard experimental collections that make the experimental results comparable, and standard tools for the analysis of the experimental results that make the analysis and assessment of experimental results comparable, too. The integration is done providing common tools for carrying out each step of the evaluation activities in a networked and distributed manner. The question of uniforming the methodology for evaluating information access components of digital libraries opens an interesting problem that is usually faced in scientific data curation: the problem of selection of data to be kept. So far, the format in which results are packed in evaluation forums is useful to exchange/transfer them but not to describe/elaborate them. Therefore, the following questions should be asked: what criteria should be applied when selecting data for longer- term retention? How do we know what we should keep? Who sets the selection criteria? How can selection be assessed, when, how often, by whom? Besides these questions, there is also the problem of the right format to use when deciding the format of the record [4]. An innovative system, called Distributed Information Re- trieval Evaluation Campaign Tool (DIRECT) [5], [6], has been designed and developed in the context of the CLEF 2005 evaluation campaign. The aim of this system is to address the issues introduced above, by providing: • the management of an evaluation forum: the track set- up, the harvesting of documents, the management of the subscription of participants to tracks; • the management of submission of experiments, the collec- tion of metadata about experiments, and their validation; • the creation of document pools and the management of relevance assessment; • common statistical analysis tools for both organizers and participants in order to allow the comparison of the experiments; • common tools for summarizing, producing reports and graphs on the measured performances and conducted analyses; • an historical vision of the submitted experiments, making them online available to participants for further compar- isons and analyses. Section II describes the adopted evaluation methodologies