An unsupervised method for learning probabilistic first order logic models from unstructured clinical text Rahul Jha rahuljha@umich.edu Dragomir Radev radev@umich.edu Computer Science and Engineering, 2260 Hayward Street, Ann Arbor, MI 48109-2121 Abstract We present a new unsupervised approach for learning probabilistic first order logic mod- els from unstructured clinical text. We use Carroll, a system that generates a shallow se- mantic parse of natural language text, to ex- tract predicates out of natural language text. These predicates are then used to learn a sim- ple probabilistic first order logic model of the underlying data. We present our work on au- tomatically learning these models and show some preliminary results obtained by mod- elling a public clinical database. 1. Introduction Generating robust prediction models from clinical data is a widespread concern in the medical community. A lot of this data is buried in free text documents how- ever, and is not available in a relational form that can be used for direct modelling. To be able to create sophisticated prediction models, we need to be able to extract relational data from this text. A lot of time and money would need to be invested to man- ually identify the relevant patterns in the textual data and convert the data to a structured form for analy- sis. Unsupervised approaches to find these correlations from structured data exist (Johnson, 1996), (Butte & Kohane, 1999), but no method for detecting such pat- terns from unstructured text in an unsupervised way was found. With supervised approaches (Visweswaran et al., 2003), we believe the need to generate enough labelled data can be a limiting factor in applying these methods to a wide range of clinical data. In this paper, we present a general approach for learn- ing prediction models from unstructured clinical text. Appearing in Proceedings of the 28 th International Con- ference on Machine Learning, Bellevue, WA, USA, 2011. Copyright 2011 by the author(s)/owner(s). Very few assumptions are made about the nature of text, which will be elaborated further in section 5. This work can be applied successfully to a diverse range of clinical data. We now present an outline of our system. After some normalization of the unstructured clinical text, we pass it through a semantic parser that produces an interme- diate first order logic representation called logic forms (Moldovan & Rus, 2001). We use some heuristics to select a subset of the predicates discovered from the above process. This subset of predicates is then used to create a first order probabilistic model of the infor- mation in the text by using them as evidence variables in a generalized model created in BLOG, a language for defining probabilistic first order models (Milch & Russell, 2007). The complete workflow is as shown in Figure 1. We use a first order probabilistic model for our sys- tem because it allows us to create a generalized model by expressing dependencies between classes of random variables. This model can be then made robust using evidence as and when it arrives. In a propositional probabilistic language such as Bayesian Networks, the number of random variables needed to define a sce- nario grows with the number of objects. First or- der probabilistic models, however, allow us to express these models concisely by abstracting over objects. For example, in BLOG, we can create a general model with the random variable classes: Patient, Ailment, and the random function: Has. We express a dependency between these by saying: Has(Patient, Ailment) = 0.01 That is all patients have all ailments with a small probability. Now every time the model encoun- ters an evidence like, Has(John Doe, Hypertension), where John Doe is an object of type Patient and Hypertension is an object of type Ailment, it revises the probability values of the function Has to take them into account. Eventually, the probability values con-