“Tout ce qui n'est point vers, est prose” : Raymond Queneau’s Matrix Analysis of Language, SyntacQc Stylometry, and Exploratory Programming Mark Wolﬀ Hartwick College (Oneonta, New York, USA) INTRODUCTION METHODS AND MATERIALS CONCLUSIONS DISCUSSION METHODS AND MATERIALS (cont.) REFERENCES ABSTRACT CONTACT Mark Wolﬀ Associate Professor of French Chair, Modern Languages Hartwick College Oneonta, NY 13820 USA Email: wolﬀm@hartwick.edu Phone: 607‐431‐4615 Web: markwolﬀ.name This is a report on experiments with a technique for syntacAc stylometry using Raymond Queneau's matrix analysis of language, cluster analysis, and principal component analysis. IniAally pursued as a method for authorship anribuAon, the technique is more accurate in disAnguishing verse from prose according to syntax alone, with no explicit reference to semanAcs, phoneAcs or scansion. These unexpected results were produced through exploratory programming. To examine syntacAc structures more closely, I used a technique developed by Khmelev and Tweedie to measure text panerns with Markov chains. Given any text, one can produce a transiAon matrix that represents the frequencies of Markov chains of bigrams based on Queneau’s schema. Table 1 shows the transiAon matrix for Molière’s Tartuﬀe. The results suggest that in the syntacAcal structure of a text, verse has a higher percentage of bigrams consisAng of signiﬁers and biwords (which contain signiﬁers) than prose. Prose texts have a higher percentage of bigrams consisAng of formaAves. While some verse texts cluster with prose texts, all the tragedies (wrinen in verse) remain disAnct from prose. Many prose texts in the corpus contain secAons of verse, which may explain why the biplot does not separate neatly into disAnct groups. Georges de Scudéry’s La Comédie des comédiens, although largely composed of prose, is a short play with an enAre scene wrinen in verse (‘Églogue,” II, 2). Charles Du Fresny’s La Coque4e de village appears distant from other verse texts, perhaps because it contains a substanAal passage in prose (I, 1). Queneau’s schema for matrix analysis seems sensiAve to mixed text types. Raymond Queneau, a founding member of the Oulipo, recognized the potenAal of computaAon for literary analysis. He developed a technique for measuring a text's syntax by tagging parts of speech according to two categories: • signiﬁers, which include nouns, adjecAves, and verbs (except avoir and être); • formaAves, which include everything else (avoir, être, pronouns, arAcles, conjuncAons, preposiAons, adverbs, interjecAons, etc.). Given a word group such as a sentence, one can construct two matrices where the ﬁrst matrix contains all formaAves and the second all signiﬁers. If a word group contains two consecuAve formaAves or signiﬁers, one can use a unitary element in order to construct the matrices (see Figure 1). It would seem that the Maître de Philosophie in Molière’s Bourgeois gen:lhomme (II, 4) is not enArely risible when explaining the diﬀerence between verse and prose to Monsieur Jourdain. There appears to be a deﬁnite measurable diﬀerence between these two text forms, at least in French. What is remarkable with this ﬁnding is that the diﬀerence does not depend on speciﬁc word choice, meter or rhyme, even though those are the qualiAes readers appreciate in verse. This discovery from exploratory programming, where a staAsAcal technique commonly used to test authorship was applied to a purely syntacAcal transcripAon of texts. The invesAgaAon of an iniAal hypothesis (that authorship can be anributed to syntacAcal panerns) led to an enArely diﬀerent conclusion through experimentaAon with computaAonal techniques. One can use computers serendipitously to discover interesAng things about texts. Established techniques in stylometry typically measure word and ngram frequencies with limited consideraAon of syntax. While it is oGen easier to access and interpret staAsAcally signiﬁcant words in a text, an analysis of syntax alone can provide interesAng and unexpected results. The analysis presented here represents what Nick Monyort calls exploratory programming, where "there's no speciﬁcaAon or problem to be solved, but there are things to be discovered.” Beaudouin, Valérie, and François Yvon. (1996) “The Metrometer : a Tool for Analysing French Verse.” Literary and Linguis:c Compu:ng 11.1: 23‐31. PDF ﬁle. Du Fresny, Charles. (1715) La Coque4e de village ou le lot supposé. Fièvre. Web. 29 June 2014. Eder, Maciej, Mike Kestemont, and Jan Rybicki. (2013) “Stylometry with R: a Suite of Tools.” Digital Humani:es: Conference Abstracts. Lincoln, NE: 2013. 487–89. PDF ﬁle. Fièvre, P. (ed. 2007‐2013). Théâtre classique. Web. 29 June 2014. Khmelev, Dmitri V., and Fiona J. Tweedie. (2001) “Using Markov Chains for IdenAﬁcaAon of Writers.” Literary and Linguis:c Compu:ng 16.3: 299–307. Web. 30 Oct. 2013. Molière. (1670) Le Bourgeois gen:lhomme, comédie‐ ballet. Fièvre. Web. 29 June 2014. Molière. (1669) Le Tartuﬀe, ou l’imposteur, comédie. Fièvre. Web. 29 June 2014. Moneort, Nick. (2014) “Exploratory Programming.” CriAcal Code Studies Working Group. Web. 7 March 2014. Queneau, Raymond. (1964) “L’Analyse matricielle du langage.” Etudes de linguis:que appliquée 3: 37–50. Print. Schmid, Helmut. “TreeTagger: a language independent part‐of‐speech tagger.” InsAtute for Natural Language Processing, University of Stungart. Web. 30 Oct. 2013. Scudéry, Georges de. (1635) La comédie des comédiens, poème de nouvelle inven:on. Fièvre. Web. 29 June 2014. Figure 1. The sentence “Le vilain chat a bien mangé la belle souris” can be represented as the product of two matrices. S F B P S 0.1103779 0.2567487 0.1997600 0.4331134 F 0.0000000 0.3675138 0.5159055 0.1165808 B 0.2013837 0.2776739 0.2215782 0.2993642 P 0.1322906 0.5360052 0.3020528 0.0296514 Table 1. TransiAon matrix for Tartuﬀe. The product of a formaAve and a signiﬁer is a biword. By adopAng the convenAons that neither (1 x 1) nor (A x 1) + (1 x B) are allowed, one avoids uninteresAng or redundant biwords. Any sentence can therefore be transformed into a sequence of pairs of words, and each pair is either a biword (B), a formaAve (F), or a signiﬁer (S). According to this schema, the sentence in Figure 1 can be rendered as BSFB BS. With more complex texts it is necessary to account for punctuaAon (P) to allow for FPS (which would otherwise be B). These lines from Molière’s Tartuﬀe (I, 1) [Cléante] Mais, Madame, après tout.... [M. Pernelle] Pour vous, Monsieur son frère, can be represented as FPS PFFPPFFPS B. Figure 2. Cluster analysis of 17th‐century French theatre. RESULTS With each bigram as a disAnct measurement of a text, one can analyze all the texts in the corpus as 15‐dimensional vectors (FS never occurs because FS = B). Figure 3 is a biplot of a principal component analysis of the vector space. The signiﬁcant rotaAons for PC1 are FP, BP and SP, correlated negaAvely with SB, FB and BB; those for PC2 are PB, SS, PS and BS, correlated negaAvely with BF, PF, FF and SF. A varimax rotaAon indicates that the dominant variable for PC1 is FP and the dominant variable for PC2 is PS. Figure 3. Biplot of PCA of Markov chain transiAon matrix for 17th‐century French theatre. PC1 accounts for 37% of variaAon and PC2 26% of variaAon. PC3 (not shown) accounts for 9% of variaAon. AGer transforming a corpus of seventeenth‐century French theatre texts (n=72) into sequences of F, S, B, and P with the TreeTagger parser, I used the stylo package for the R staAsAcal programming environment to produce a cluster analysis (see Figure 2).