Testing Concept Indexing in Crosslingual Medical Text Classification
Francisco Carrero, Jose Carlos Cortizo
Universidad Europea de Madrid
{francisco.carrero, josecarlos.cortizo}@uem.es
Jose Maria Gomez
Departamento de I+D, Optenet
jgomez@optenet.com
Abstract
MetaMap is an online application that allows mapping
text to UMLS Metathesaurus concepts, which is very use-
ful for interoperability among different languages and sys-
tems within the biomedical domain. MetaMap Transfer
(MMTx) is a Java program that makes MetaMap available
to biomedical researchers in controlled, configurable envi-
ronment. Currently there is no Spanish version of MetaMap,
which difficult the use of UMLS Metathesaurus to extract
concepts from Spanish biomedical texts. Developing a
Spanish version of MetaMap would be a huge task, since
there has been a lot of work supporting the English version
for the last sixteen years.
Our ongoing research is mainly focused on using
biomedical concepts for cross-lingual text classification. In
this context the use of concepts instead of bag of words rep-
resentation allows us to face text classification tasks ab-
stracting from the language. In this paper we show our
experiments on combining automatic translation techniques
with the use of biomedical ontologies to produce an English
text that can be processed by MMTx in order to extract con-
cepts for text classification.
1. Introduction
Information overload is common nowadays in our so-
ciety. This is also the case for biomedical information,
available from a variety of sources, as scientific papers,
databases of summaries, structured or semi-structured data-
bases and web services and clinical records of patients. In
this domain, professionals in general need tools oriented
to provide facilities for accessing and visualizing the ade-
quate information for their needs. Medline, the most impor-
tant and consulted bibliographical database in the biomedi-
cal domain, constitutes a main example. Medline contains
more than 16 million references, with an increment between
2.000 and 4.000 references per day, and over 670,000 total
added in 2007 [1].
In order to increase the retrieval and interoperability be-
tween biomedical resources, one of the key solutions may
lie in the development of common terminologies acting as
a metadata layer allowing link elements from various re-
sources. UMLS (Unified Medical Language System) [4]
constitutes a major repository of biomedical standard ter-
minologies including controlled vocabularies and resources,
such as MeSH, ICD-10, the Gene Ontology or SNOMED-
CT that have served well in their respective domain. This
knowledge has proved useful for many applications in-
cluding decision support systems, management of patient
records, information retrieval and data mining.
Several systems have been developed having as main
goal the identification of concepts based on the text analysis
of documents, ranging applications from genomics, drugs
identification, and concrete aspects such as protein-protein
interaction [5, 13] The MetaMap system [2] is nowadays the
standard application developed at the National Library of
Medicine (NLM) that identifies biomedical concepts from
free-text documents and maps them to entries in UMLS.
1.1. Project Description
In this paper we present MIRCAT (Multilingual Infor-
mation Retrieval based on Concepts and Automated Trans-
lation), a cross-lingual system to retrieve biomedical doc-
uments significantly related to medical records. Given a
query in Spanish submitted by a person, it firstly retrieves a
list of medical records ordered by relevance in two steps: 1)
the query is expanded using concepts included in a biomed-
ical ontology (i.e.: UMLS); 2) medical records are ranked
using a representation based on biomedical concepts. Then,
the user can choose a record and the system will retrieve
several lists of ranked documents as follows: 1) Spanish
news; 2) English news; 3) Spanish article abstracts; and 4)
English article abstracts. This last step is done by using
concepts to rank the documents against the selected medi-
cal record.
Throughout all the phases we need to obtain a seman-
tic document representation, which makes it definitely cru-
cial to use an accurate system to extract concepts from text.
Keeping in mind that we are mainly working with UMLS,
978-1-4244-2917-2/08/$25.00 ©2008 IEEE 512