A Suite to Compile and Analyze an LSP Corpus Rogelio Nazar, Jorge Vivaldi, Teresa Cabré Institut Universitari de Lingüística Aplicada Universitat Pompeu Fabra Pl. de la Mercè 10-12 08002 Barcelona E-mail: {rogelio.nazar; jorge.vivaldi; teresa.cabre}@upf.edu Abstract This paper presents a series of tools for the extraction of specialized corpora from the web and its subsequent analysis mainly with statistical techniques. It is an integrated system of original as well as standard tools and has a modular conception that facilitates its re-integration on different systems. The first part of the paper describes the original techniques, which are devoted to the categorization of documents as relevant or irrelevant to the corpus under construction, considering relevant a specialized document of the selected technical domain. Evaluation figures are provided for the original part, but not for the second part involving the analysis of the corpus, which is composed of algorithms that are well known in the field of Natural Language Processing, such as Kwic search, measures of vocabulary richness, the sorting of n-grams by frequency of occurrence or by measures of statistical association, distribution or similarity. 1 Introduction This paper presents a software that consists of an integrated set of tools for the acquisition of a specialized corpus from the web and its subsequent exploration by means of a collection of statistical techniques. Our aim was mainly terminology extraction, but we are aware of the fact that other users may find these techniques useful for other research interests. The system is divided in two main modules. The first one is devoted to corpus compilation from the web with some facilities for the selection of documents of a given domain. The second module is organized in a series of algorithms used in natural language processing. Both modules are independent, which means that the corpus extracted with the first module is not necessarily the one that will serve as input for the second module. The program is currently implemented as a web application 1 , however new versions as a Perl Module and a cross- platform GUI application are about to be released. 2 The First Module: Extraction of a Corpus from the Web. Since it became massively used, linguists have become aware that the Internet is an invaluable source of data. Programs that simply download massive collections of documents are now common, but the result is usually a highly noisy corpus. “Bootcat” (Baroni & Bernardini, 2004) is a better choice because it accepts a set of seed words as input. In our particular case, however, we are interested in going further. Since our aim is the study of technical terminology, therefore we need a tool capable of gathering a high quality collection of specialized documents of a given domain. 1 The URL is http://jaguar.iula.upf.edu 2.1. The Algorithm. The system proposed requires, as BootCat, a term or a collection of terms for starting the downloading process. However, there are different possibilities to train the system with feedback about the desired kind of documents. For instance, by providing or selecting terms or documents considered relevant or irrelevant, among other parameters such as the language or the format (html, pdf, doc, ps, etc.). According to our experience, the selection of the document format has a dramatic impact on the quality of the downloaded corpus. The probabilities of gathering a specialized corpus is much higher when downloading pdf or ps formats instead of html. As mentioned above, based only on a single term, it can retrieve a collection of documents and perform an unsupervised classification to offer clusters to be selected as representative of the desired domain. After that, more documents will be retrieved and ranked according to their similarity with the selected cluster. The clustering is done by building co-occurrence networks with the best weighted terms as nodes. Nodes are weighted using Mutual Information, and a corpus of general language is taken as reference for the expectation of word frequencies. The weight of an edge between a node i and a node j (W ij ) is log ( F ij / N ), where F ij is the frequency of co-occurrence of the nodes and N the number of contexts analyzed, which are segments of a parameterizable number of words where the input term occurs. Networks are pruned eliminating the weakest connections. Hubs of nodes in these networks indicate the existence of documents about the same topic. Figure 1 shows an example of a division of Spanish documents containing the term broca: they may be about neurology, as in area de Broca , or documents that use the term in its sense as a piece of a drill. 1164