Toward Automatic Character Identification in Unannotated Narrative Text Josep Valls-Vargas 1 , Santiago Onta ˜ on 1 and Jichen Zhu 2 1 Computer Science, 2 Digital Media Drexel University Philadelphia, PA 19104 josep.vallsvargas@drexel.edu, santi@cs.drexel.edu, jichen.zhu@drexel.edu Abstract We present a case-based approach to character identifi- cation in natural language text in the context of our Voz system. Voz first extracts entities from the text, and for each one of them, computes a feature-vector using both linguistic information and external knowledge. We pro- pose a new similarity measure called Continuous Jac- card that exploits those feature-vectors to compute the similarity between a given entity and those in the case- base, and thus determine which entities are characters or not. We evaluate our approach by comparing it with different similarity measures and feature sets. Results show an identification accuracy of up to 93.49%, sig- nificantly higher than recent related work. Introduction Computational narrative systems, especially story genera- tion systems, require the narrative world to be encoded in some form of structured knowledge representation formal- ism (Bringsjord and Ferrucci 1999; Onta ˜ on and Zhu 2011). Currently this representation is mostly hand-authored for most computational narrative systems. This is a notoriously time-consuming task requiring expertise in both storytelling and knowledge engineering. This well-known “authorial bottleneck” problem can be alleviated by automatically ex- tracting information, at both linguistic and narrative levels, from existing written narrative text. Except for a few pieces of work such as (Elson 2012; Finlayson 2008), automatically extracting structure-level narrative information directly from natural language text has not received much attention. We would like to further con- nect the research areas of computational narrative and Natu- ral Language Processing (NLP) in order to develop methods that can automatically extract structural-level narrative in- formation (e.g., Proppian functions) directly from text. In this paper, we present our approach to automatically identifying characters from unannotated stories in natural language (English). Characters play a crucial role in stories; they make events happen and push the plot forward. De- pending on the genre, people along with anthropomorphized animals and objects can all act as characters in a story. Be- ing able to identify which entities in the text are characters is a necessary step toward our long-term goal of extracting structure-level narrative information such as character roles. We present a case-based approach for character identifi- cation in the context of our system Voz. After extracting en- tities from the text, Voz computes a set of 193 features using both linguistic information in the text and external knowl- edge. We propose a new similarity measure called Contin- uous Jaccard to compute similarity between a given entity and those in the case-base of our system, and thus determine whether the new entity is a character or not. We evaluate our approach by comparing it with different similarity measures and feature sets. Results show an identification accuracy of 93.49%, a significant increase from recent related work. The rest of this paper is organized as follows. We first present related work. Then we discuss our approach for case-based character identification. After presenting our dataset and feature set, we discuss our empirical evaluation. Finally we conclude and discuss directions of future work. Related Work Character identification, related to named entity recognition and nominal actor detection, is a crucial step toward nar- rative structure extraction. Goyal et al.’s AESOP system (2010) explored how to extract characters and their affect states from textual narrative in order to produce plot units (Lehnert 1981) for a subset of Aesop fables. The system used both domain-specific assumptions (e.g., only two char- acters per fable) and external knowledge (word lists and hy- pernym relations in WordNet) in its character identification stage. More recently, Calix et al. (2013) proposed an ap- proach for detecting characters (called “sentient actors” in their work) in spoken stories based on features in the tran- scribed textual content using ConceptNet and speech pat- terns (e.g., pitch). Their system detects characters through supervised learning techniques and uses this information for improving document retrieval. The work presented in this paper follows this line of work, but we propose a case-based approach with an extended set of features and a new similar- ity measure, obtaining significantly better results. Also relevant for the work presented in this paper is that of more general narrative structure extraction. Chambers and Jurafsky (2008) proposed using unsupervised induction to learn what they called “narrative event chains” from raw newswire text. In order to learn Schankian script-like in- formation about the narrative world, they use unsupervised learning to detect the event structures as well as the roles