Mining Biomedical Abstracts: What’s in a Term? Goran Nenadić Dept. of Computation UMIST, Manchester G.Nenadic@umist.ac.uk Irena Spasić Dept. of Chemistry UMIST, Manchester I.Spasic@umist.ac.uk Sophia Ananiadou Computer Science University of Salford S.Ananiadou@salford.ac.uk Abstract In this paper we present a study of the usage of terminology in biomedical literature, with the main aim to indicate phenomena that can be helpful for automatic term recognition in the domain. Our comparative analysis is based on the terminology used in the Genia corpus. We analyse the usage of ordinary biomedical terms as well as their variants (namely inflectional and orthographic alternatives, terms with prepositions, coordinated terms, etc.), showing the variability and dynamic nature of terms used in biomedical abstracts. Term coordination and terms containing prepositions are analysed in detail. We show that there is a discrepancy between terms used in literature and terms listed in controlled dictionaries. We also evaluate the effectiveness of incorporating different types of term variation into an automatic term recognition system. 1 Introduction Biomedical information is crucial in research: details of clinical and/or basic research and experiments produce priceless resources for further development and applications (Pustejovsky et al., 2002). The problem is, however, the huge volume of the literature, which is constantly expanding both in size and thematic coverage. For example, a query “breast cancer treatment” submitted to PubMed (NLM, 2003a) returned nearly 70,000 abstracts in 2003 and 20,000 abstracts back in 2001. It is clear that it is indeed impossible for any domain specialist to manually examine such huge amount of abstracts. An additional challenge is the rapid change of the biomedical terminology and the diversity of its usage. It is quite common that almost every new biomedical text introduces new names and terms. Also, the problem is the extensive terminology variation and use of synonyms. For example, a study reported by Ding et al. (2002) found that – when querying Medline – target interactions contained in an abstract were “often described using a synonym of the query term”. The main source of this “terminological confusion” is that the naming conventions are not completely clear or standardised, although some attempts in this direction are being made. Naming guidelines do exist for some types of biomedical concepts (e.g. the Guidelines for Human Gene Nomenclature (Lander et al., 2001)). However, domain experts also frequently introduce specific notations, acronyms, ad-hoc and/or innovative names for new concepts, which they use either locally (within a document) or within the wider community. Even when an establish name exists, authors may prefer – e.g. from traditional reasons – to use alternative names, variants or synonyms. In this paper we present a detailed analysis of the terminology usage, performed mainly on a manually terminologically tagged corpus. We analyse the terminology that is used in literature, rather than the terminology presented in controlled resources. After presenting the resources that we have used for our work in Section 2, in Section 3 we analyse the usage of ordinary term occurrences (i.e. term occurrences involving no structural variation), while in Section 4 we discuss more complex terminological variation (namely coordination and conjunctions of terms, terms with prepositions, acronyms, etc.). We also evaluate the effectiveness of incorporating specific types of term variation into an automatic term recognition (ATR) system, and we conclude by summarising our experiments.