Medical Document Categorization Using a Priori Knowledge Lukasz Itert 1,2 , W lodzislaw Duch 2,3 , and John Pestian 1 1 Department of Biomedical Informatics, 3333 Burnet Avenue, Children’s Hospital Research Foundation, Cincinnati, OH 45229, USA 2 Department of Informatics, Nicolaus Copernicus University, Toru´ n, Poland 3 School of Computer Engineering, Nanyang Technological University, Singapore Abstract. A signiﬁcant part of medical data remains stored as unstruc- tured texts. Semantic search requires introduction of markup tags. Ex- perts use their background knowledge to categorize new documents, and knowing category of these documents disambiguate words and acronyms. A model of document similarity that includes a priori knowledge and captures intuition of an expert, is introduced. It has only a few pa- rameters that may be evaluated using linear programming techniques. This approach applied to categorization of medical discharge summaries provided simpler and much more accurate model than alternative text categorization approaches. 1 Introduction The dream of semantic Internet populated with documents annotated with XML tags remains a diﬃcult challenge. Automatic tools that convert unstructured textual data into semantically-tagged documents are still elusive. In the medical domain the need to create these tools is acute because errors may be costly, medical vocabularies are abbreviations and acronyms are rampant. Critical dif- ferences between General English and Medical English have been analyzed in a numbers of publications [1]. The “Discovery System” (DS) data repository [2] at the Cincinnati Childrens Hospital Medical Center (CCHMC), a large pedi- atric academic medical center with over 700,000 pediatric patient encounters per year, contains terabytes of medical data, mostly in form of raw texts, stored in a complex, relational database integrating many electronic hospital services. The long-term goal of our research is to create tools that automatically anno- tate unstructured medical texts, adding full information about all medical con- cepts, ambiguous terms, expanding acronyms and abbreviations, using a variety of statistical and computational intelligence algorithms to achieve this goal. The ﬁrst step towards full semantic annotation and disambiguation of medical text requires discovery of the document topic, for example the main disease that has been treated. It is clear that medical expert reading a given text quickly forms a hypothesis about the particular sub-domain the text belongs to and interprets the text in the light of the background knowledge derived from medical studies, textbooks and individual experience. This is especially true if relatively short W. Duch et al. (Eds.): ICANN 2005, LNCS 3696, pp. 641–646, 2005. c  Springer-Verlag Berlin Heidelberg 2005