An Anaphora Resolution-Based Anonymization Module M. Poesio, * M. A. Kabadjov, * P. Goux, * U. Kruschwitz * , E. Bishop † and L. Corti † University of Essex * Department of Computer Science / Language and Computation Group † UKDA Colchester C04 3SQ, United Kingdom Abstract Growing privacy and security concerns mean there is an increasing need for data to be anonymized before being publically released. We present a module for anonymizing references implemented as part of the SQUAD tools for specifying and testing non-proprietary means of storing and marking-up data using universal (XML) standards and technologies. The tool is implemented on top of the GUITAR anaphoric resolver. 1. Introduction Growing privacy and security concerns mean there is an in- creasing need for data to be anonymized before being pub- lically released. Providing tools to facilitate the task is one of the goals of the Smart Qualitative Data SQUAD project, one of whose objectives is to use natural language process- ing technology–specifically, the LT- XML tools 1 developed by the University of Edinburgh’s Language and Technol- ogy Group–to develop and implement user-friendly tools for semi-automating processes to prepare qualitative data for traditional digital archiving and other types of process- ing. The tools developed as part of the project should make it possible for Social Sciences researchers to access data such as the transcripts of interviews stored in the Univer- sity of Essex’s Data Archive. However, the names of the individuals who agreed to participate in the interview need to be anonymized, possibly in an automatic form. In this poster we present preliminary work on an anonymization tool developed as part of the SQUAD project. Like the rest of the software developed in the project, the anonymization tool is designed to work off the LT- XML tools and to interface with the NITE XML TOOLKIT (NXT). The key idea is to take advantage of an existing anaphora resolution system also designed to interface with the LT- XML tools, the GUITAR 3.1 system (Poesio and Kabadjov, 2004; Poesio et al., 2005), which we are already using for summarization (Steinberger et al., 2005). An anonymiza- tion tool based on an anaphoric / coreference resolver could potentially simplify the task of anonymization by eliminat- ing the need to identify all possible forms used to mention a particular individual. This experiment would also provide us with a different way of evaluating GUITAR. In this paper, we briefly describe GUITAR, then present the anonymization algorithm, and discuss future work. 2. GUITAR 3.1 GUITAR is an anaphora resolution system designed to be high precision, modular, and usable as an off-the-shelf component of a NLP pipeline such as the LT-XML tools. 2.1. Input GUITAR takes XML input in a format called MAS- XML, 1 http://www.ltg.ed.ac.uk/software/xml/ which augments to produce output also in XML format. It can work with a variety of preprocessing tools ranging from simple POS taggers to chunkers (such as LT- CHUNK) to full parsers (an interface to Charniak’s parser has been imple- mented), provided that their output can be converted into MAS- XML format (typically, by heuristic methods). These features makes GUITAR very suitable for the intended ap- plication, in which it will work as a component for a pre- processing module whose output will then be manually edited for final corrections using NXT. MAS- XML is illustrated in Figure 2, which shows the type of input GUITAR expects for a text like the one in Figure 1. At a minimum, GUITAR expects the text to have been tok- enized and POS-tagged, and sentences and nominal phrases (NEs) to have been identified. The system can also take advantage of other types of information if available–e.g., about grammatical function, or about named entity types. My grandpa Gaunting married when my mother was - just under ten. So - he remarried. And my mother calls her Doris as well. Figure 1: An example of raw text 2.2. Anaphora Resolution Algorithms GUITAR uses an implementation of the MARS pronoun resolution algorithm (Mitkov, 1998) to resolve personal and possessive pronouns. The system resolves definite descriptions using a partial implementation of the algo- rithm proposed in (Vieira and Poesio, 2000), augmented with a statistical classifier to identify discourse-new defi- nite descriptions (Poesio et al., 2005). Finally, GUITAR 3.1 also includes an implementation of the shallow algorithm for resolving coreference with proper names proposed by (Bontcheva et al., 2002). Whenever GUITAR identifies an anaphoric relation, it adds to its output a new aante element specifying a possible anchor for the anaphoric expression participating in the re- lation; GUITAR never deletes anything from its input. For example, an ideal result for the input in Figure 2 would be for GUITAR to recognize that my grandpa Gaunting and