Reuse of Lexicographic Data for a Multipurpose Pronunciation Database and Phonetic Transcription Generator for Regional Variants of Portuguese Simone Ashby and José Pedro Ferreira 1 Instituto de Linguística Teórica e Computacional (ILTEC) Among the benefits of a flexible and modular lexical database are: the facility of building new modules from existing ones, the reuse of lexicographic data to both enhance the user experience and achieve NLP aims, the time saved in accomplishing these objectives, and the economy that comes from minimizing redundancy (van der Eijk, Bloksma, and van der Kraan 1992). LUPo, or the Portuguese Unisyn Lexicon, is one of the first speech-dedicated applications to take full advantage of a collection of lexical resources as the basis for a text-to-speech system. Consisting of a pronunciation lexicon and rule system for generating accent-specific phonetic transcriptions for Portuguese, LUPo circumvents the cost of producing high-quality phonetic transcriptions by hand, while attracting a wider pan Lusophone audience to the lexical database in which it resides, and providing the research community with a vast resource of Portuguese accent data for evaluating speech applications and testing theories. 1. Introduction This paper presents a description of LUPo's functions for online use, and the architecture and administrative layer that support this application. Implications for practical lexicography are also presented in terms of the emerging role of the multi-dimensional and lexicographically rich Portal database as a pan Lusophone resource and basis for addressing natural language processing (NLP) problems. Our presentation will showcase the LUPo system, with a focus on its setup, results for the end user, and an easy-to-use lexicographic back end for maintaining and expanding the pronunciation database. More in-depth information about LUPo and the English Unisyn lexicon upon which it is based may be found in Ashby, Ferreira, and Barbosa (2009) and Fitt (2000), respectively. 2. Background Unconstrained by the traditional expectations of a dictionary, lexical databases have the capability of being more dynamic in the types of functions they serve, audiences they target, and information they reuse. The Portal da Lingua Portuguesa (Janssen 2007), hereafter referred to as the Portal, is one such collection of lexicographic resources that is designed both for human consumption and computational exploitation (Janssen 2005). The Portal's modular architecture, and aims for extending this database to a global audience, including the general public and research communities alike, provide an ideal background for developing and supporting functions designed to enhance the user experience and serve as inputs to NLP systems. One of the issues of reaching out to a pan Lusophone audience, or even say a Brazilian one, is the difficulty of selecting pronunciations that are not so abstract as to alienate users. Traditionally, this has not been of great concern to lexicographers, including authors of pronunciation dictionaries, who avoid presenting ‘a variant of contestable status, be it regional, stylistic or social, to the extent that they wish to continue to describe t he norm’ (de Caluwe and van Santen 2003: 71-82). The problem for Brazilian Portuguese, pushing global 1 The authors gratefully acknowledge the support of the Fundação para a Ciência e a Tecnologia (PTDC/CLE- LIN/100335/2008), and the cooperation of Susan Fitt, whose development of the original English Unisyn Lexicon is the inspiration for this work. 241 1 / 4 1 / 4