Across Languages and Cultures 20 (2) pp. 197–211 (2019)
DOI: 10.1556/084.2019.20.2.3
1585-1923 © 2019 Akadémiai Kiadó, Budapest
THE TRADE-OFF BETWEEN QUANTITY
AND QUALITY. COMPARING A LARGE
CRAWLED CORPUS AND A SMALL FOCUSED
CORPUS FOR MEDICAL TERMINOLOGY
EXTRACTION
V ERONIQUE HOSTE, KLAAR V ANOPSTAL,
A YLA RIGOUTS T ERRYN, ELS LEFEVER
Ghent University,
Department of Translation, Interpreting and Communication
Groot-Brittanniëlaan 45, 9000 Gent, Belgium
E-mail: veronique hoste@ugent.be
E-mail: klaar.vanopstal@ugent.be
E-mail: ayla.rigoutsterryn@ugent.be
E-mail: els.lefever@ugent.be
Abstract: We investigate the cost-effectiveness of special-purpose crawled corpora
versus more focused corpora for automatic terminology extraction (ATE). Our focus is on
medical terminology on heart failure for two languages, viz. English for which we have more
web and specialized resources at our disposal and the less resourced Dutch. We show that,
although term density in the dedicated corpora is larger for both languages, the potential for
term extraction is higher in the crawled corpora than in the dedicated corpora. Furthermore,
in a set of experiments in which we evaluate both types of corpora, while keeping size con-
stant, we observe that more Gold Standard (GS) terms are covered by the “noisy” crawled
corpus than with a dedicated corpus of the same size.
Keywords: terminology, automatic terminology extraction, corpora, medical terminology