SUPeRB: Building bibliographic resources on the computational processing of Portuguese Luís Miguel Cabral, Diana Santos, Luís Fernando Costa Linguateca, Oslo node, SINTEF ICT, Norway {Luis.M.Cabral, Diana.Santos, Luis.Costa}@sintef.no Abstract. SUPeRB is a digital library helper that aims at updating and maintaining specific publication repositories, and assisting in the publishing of publication records, for institutions and individual actors. It gathers bibliographic data from Web pages and documents and integrates that data into a local repository of bibliographic data on a specific domain. By collecting information from these resources, SUPeRB also assists in building a bibliographic database with the specific domain intervenients such as authors, conferences and scientific journals. The computational processing of the Portuguese language has been the considered domain . 1 Introduction Since 1999, Linguateca has been offering a portal about the computational processing of Portuguese aiming at a reasonable complete overview of the field. Linguateca’s goal is to provide a place that helps researchers and developers not to start from scratch and keep them informed of the work of their peers. One of the resources we maintain is a publication catalogue surveying published work in this field. From 1999 to 2003, we manually gathered approximately 750 items, including, if available, their electronic version. Although our team routinely screens mailing lists and lists of accepted papers in calls for participation for relevant conferences, it is hard to maintain this catalogue updated. It is especially troublesome to find accurate and complete information about papers and other works, since researchers often fail to keep their publications pages up to date. Furthermore, it is frequent to find barriers that difficult processing the information, such as: Incomplete citing by omitting the conferences’ full names, the volume editors, conference edition or place of conference; Several bibliographic styles employ author’s initials, making it hard to identify them; Electronic version is not exactly the same as the published one (at least in what formatting is concerned). It should be added that virtually none of the authors we survey in our catalogue uses meta-data or any kind of categorization of their own works. Usually, their publications list is a web page presenting only their textual references, in some cases, without links to the electronic versions. This lack of data can make it difficult to decide, only by the title, whether or not to include the item as relevant. Furthermore, users are rarely motivated enough to help us catalogue more publications by suggesting their own publications or others that they could find relevant. In any case, with the overwhelming increase of information on the Web it is consensual that one needs digital methods to help to organize and make useful the distinct information. We have therefore tried to address the need for an automated helper to support searches and to obtain bibliographic data from Web documents, as well as evaluating their relevance for our catalogue and organize it accordingly. Our goal was not to provide a fully automated system, but rather deploy a supervised approach to help humans obtain better results in the task of aiding an expert to create a meaningful and coherent publication list, and help maintain it with contributions from the particular community of interest. Our goal is thus similar to the one of Feitelson [6], and not in any way an attempt to replace or compete with CiteSeer [9]. SUPeRB aims at providing the publication catalogue with organized data, which can later be updated and allows also better means of accessing that bibliographic data. 2 SUPeRB, a (digital) library helper SUPeRB, as described in detail in [4], is a semi-automatic system whose purpose is to help searching and processing bibliographic references from the Web, with a specific contextual bias, as well as aid an expert to construct and maintain bibliographic meta-data collections from information given by several users.