Testing Concept Indexing in Crosslingual Medical Text Classification Francisco Carrero, Jose Carlos Cortizo Universidad Europea de Madrid {francisco.carrero, josecarlos.cortizo}@uem.es Jose Maria Gomez Departamento de I+D, Optenet jgomez@optenet.com Abstract MetaMap is an online application that allows mapping text to UMLS Metathesaurus concepts, which is very use- ful for interoperability among different languages and sys- tems within the biomedical domain. MetaMap Transfer (MMTx) is a Java program that makes MetaMap available to biomedical researchers in controlled, configurable envi- ronment. Currently there is no Spanish version of MetaMap, which difficult the use of UMLS Metathesaurus to extract concepts from Spanish biomedical texts. Developing a Spanish version of MetaMap would be a huge task, since there has been a lot of work supporting the English version for the last sixteen years. Our ongoing research is mainly focused on using biomedical concepts for cross-lingual text classification. In this context the use of concepts instead of bag of words rep- resentation allows us to face text classification tasks ab- stracting from the language. In this paper we show our experiments on combining automatic translation techniques with the use of biomedical ontologies to produce an English text that can be processed by MMTx in order to extract con- cepts for text classification. 1. Introduction Information overload is common nowadays in our so- ciety. This is also the case for biomedical information, available from a variety of sources, as scientific papers, databases of summaries, structured or semi-structured data- bases and web services and clinical records of patients. In this domain, professionals in general need tools oriented to provide facilities for accessing and visualizing the ade- quate information for their needs. Medline, the most impor- tant and consulted bibliographical database in the biomedi- cal domain, constitutes a main example. Medline contains more than 16 million references, with an increment between 2.000 and 4.000 references per day, and over 670,000 total added in 2007 [1]. In order to increase the retrieval and interoperability be- tween biomedical resources, one of the key solutions may lie in the development of common terminologies acting as a metadata layer allowing link elements from various re- sources. UMLS (Unified Medical Language System) [4] constitutes a major repository of biomedical standard ter- minologies including controlled vocabularies and resources, such as MeSH, ICD-10, the Gene Ontology or SNOMED- CT that have served well in their respective domain. This knowledge has proved useful for many applications in- cluding decision support systems, management of patient records, information retrieval and data mining. Several systems have been developed having as main goal the identification of concepts based on the text analysis of documents, ranging applications from genomics, drugs identification, and concrete aspects such as protein-protein interaction [5, 13] The MetaMap system [2] is nowadays the standard application developed at the National Library of Medicine (NLM) that identifies biomedical concepts from free-text documents and maps them to entries in UMLS. 1.1. Project Description In this paper we present MIRCAT (Multilingual Infor- mation Retrieval based on Concepts and Automated Trans- lation), a cross-lingual system to retrieve biomedical doc- uments significantly related to medical records. Given a query in Spanish submitted by a person, it firstly retrieves a list of medical records ordered by relevance in two steps: 1) the query is expanded using concepts included in a biomed- ical ontology (i.e.: UMLS); 2) medical records are ranked using a representation based on biomedical concepts. Then, the user can choose a record and the system will retrieve several lists of ranked documents as follows: 1) Spanish news; 2) English news; 3) Spanish article abstracts; and 4) English article abstracts. This last step is done by using concepts to rank the documents against the selected medi- cal record. Throughout all the phases we need to obtain a seman- tic document representation, which makes it definitely cru- cial to use an accurate system to extract concepts from text. Keeping in mind that we are mainly working with UMLS, 978-1-4244-2917-2/08/$25.00 ©2008 IEEE 512