Linguistic Pattern Extraction and Analysis for Classic French Plays Francesca Frontini, Mohamed Amine Boukhaled, Jean-gabriel Ganascia LIP6 (Laboratoire d’Informatique de Paris 6), Université Pierre et Marie Curie and CNRS / OBVIL {francesca.frontini, mohamed.boukhaled, jean- gabriel.ganascia}@lip6.fr 1. Introduction and approach Great authors of fiction and theatre have the capacity of creating memorable characters that take life and become almost as real as living persons to the readers/audience. The study of characterization, namely of how this is achieved, is a well-researched topic in corpus stylis- tics: for instance (Mahlberg, 2012) attempts to identify typical lexical patterns for memorable Dickens’ characters by extracting those lexical bundles that stand out (namely are overrepre- sented) in comparison to a general corpus. In other works, authorship attribution methods are applied to the different characters of a play to identify whether the author has been able to provide each of them with a “distinct” voice. For instance (Vogel & Lynch, 2008) compare individual Shakespeare characters against the whole play or even against all plays of the same author. The purpose of this paper is to propose a methodology for the study characterization of sev- eral characters in French plays of the classical period. The tools developed are meant to sup- port textual analysis by: 1) Verifying the degree of characterization of each character with respect to others. 2) Automatically inducing a list of linguistic features that are significant, representative for that character. Preliminary investigations have been conducted on plays by Moliere, cross-comparing four protagonists from four different plays. The proposed methodology relies on sequential data mining for the extraction of linguistic patterns and on correspondence analysis for comparison of patterns frequencies in each character and for the visual representation of such differences. 2. Syntactic pattern extraction and ranking In our study, we consider a syntagmatic approach based on a quite similar configuration to the one proposed by (Quiniou, Cellier, Charnois, & Legallois, 2012) . The text is first segmented into a set of sentences, and then each sentence is mapped into a sequence of syntactic (POS-tag) items. For example the sentences “J'aime ma maison où j'ai grandi.” is first mapped to a sequence of PoSTags, <PRO:PER VER:pres DET:POS NOM PRO:REL PRO:PER VER:pres VER:pper SENT>; then sequential patterns of a determined length are extracted. A minimal filtering is applied, removing patterns with less than 5% of support; nevertheless sequential pattern mining is known to produce (depending on the window and gap size) a large quantity of patterns even relatively small samples of texts. In order to identify the most relevant patterns for each of the four characters we thus used correspondence analysis (CA), which is a multivariate statistical technique developed by (Benzécri, 1977) and used for data analysis. CA allows us to represent both the characters and the patterns on a bi-dimensional space, thus making it visually clear not only which characters are more similar to each other but also which patterns are over/underrepresented - that is more distinctive - for each character or group of characters. Moreover patterns can be