A modular architecture for the processing of free text Toni Badia, Gemma Boleda, Mart´ ı Quixal, Eva Bofias Institut Universitari de Ling¨ ıstica Aplicada Universitat Pompeu Fabra Rambla, 30-32 Barcelona E-08002 toni.badia@trad.upf.es,{gemma.boleda,marti.quixal}@iula.upf.es, eva@RhetoricalSystems.com To appear in the Proceedings of the Workshop on ’Modular Programming applied to Natural Language Processing’ at EUROLAN 2001 Abstract This paper describes the free text pro- cessing strategy that is being set up in our institute. The system is designed to deal with general, written Catalan texts, as they appear in, say, daily newspa- pers. Our strategy has been to divide the whole processing into specific subtasks, applying to each of them the best strat- egy available. The main advantages of the architecture we put forth are that it is highly modular and reusable, and that it permits a fully automatic processing of unrestricted text. 1 Introduction The processing streamline that we envisage is in- tended to carry out the automatic analysis of real Catalan texts. From the start it has been designed in a modular way, so that the best strategy for each specific task can be chosen, and a progres- sive improvement of the whole processing can be obtained as new modules are available. We are interested in the tagging of texts with linguistic information, so that the operations that are performed on them can be based not only on their surface form but also on their linguis- tic structure. Our aim is to achieve a linguistic tagging of running text as precise and detailed as possible, bearing in mind a wide range of pos- sible further applications (from grammar check- ing to information extraction). This tagging in- volved initially only part-of-speech, but is being extended to morphosyntactic and strictly syntac- tic information. We also plan to include semantic and pragmatic information in the future. It is however impossible to achieve this com- plex task in one shot, since neither the resources nor the techniques are fully available at one given moment in time. We therefore developed a pro- cessing setting in which we could (1) start pro- cessing and extracting information from texts from the very beginning of the project; and (2) add new modules if and when they were available. The paper is organised as follows: section 2 de- scribes the basic architecture of the system: the text handler and the morphological and syntac- tic analysis modules. Section 3 describes how we took advantage of previous existing tools (created at our institute or not). Section 4 presents sev- eral modules that we plan to add to our parsing architecture to achieve a deeper analysis. Section 5 details the current state of the project. Section 6 is a comparison with other approaches. The paper ends with some conclusions. 2 Basic architecture At the beginning of the process (see Figure 1) we have a small text handling module, that prepares the text to be tagged with linguistic information. The kernel of the processing schema is the set of the modules covering the morphological and shal- low syntactic analysis. In each case there is a dis- tinction between the initial assignment of tags and the subsequent disambiguation. In the following subsections we describe each of the modules separately.