XRCE Personal Language Analytics Engine for Multilingual Author Proﬁling Notebook for PAN at CLEF 2015 Scott Nowson, Julien Perez, Caroline Brun, Shachar Mirkin, and Claude Roux Xerox Research Centre Europe {first name.last name}@xrce.xerox.com Abstract This technical notebook describes the methodology used – and results achieved – for the PAN 2015 Author Proﬁling Challenge by the team from Xe- rox Research Centre Europe (XRCE). This year, personality traits are introduced alongside age and gender in a corpus of tweets in four languages – English, Span- ish, Italian and Dutch. We describe a largely language agnostic methodology for classiﬁcation which uses language speciﬁc linguistic processing to generate fea- tures. We also report on experiments in which we use machine translation to accommodate for languages in which there is less training data. Native language results are successful, but socio-demographic signals in language seem to be lost under MT conditions. 1 Introduction Personal Language Analytics is a branch of text mining in which the object of analysis is the author of a document rather than the document itself. Language use in text (or indeed, speech) can reveal a great deal about a person: it can reveal one’s gender, age or nationality, among other demographic traits. It can also provide clues as to one of the most important individual differences: personality. For example, when writing per- sonal emails, out-going, social Extraverts are more likely to start by saying ‘hi’ while Introverts opt for ‘hello’ [7]. Personality traits (and indeed the other human attributes mentioned) are a valuable source of information for applications such as user modeling or social media engage- ment. Work in this area, particularly in the computational recognition of personality, is garnering increasing interest with a number of workshops being organized in recent years (e.g. [3], [21]). The addition of personality as a target trait in the PAN Author Proﬁling challenge in 2015 [18] serves as further evidence. This paper presents the contribution of Xerox Research Centre Europe to the Author Proﬁling challenge 2015. We leverage our experience in multi-lingual processing by using language speciﬁc tools for each the four languages of the data set (see section 2 for more details). However, our methodology beyond this processing is broadly language agnostic: as much as possible we use a comparable feature set across each language; we also use the same parameters in our experiments. One notable aspect of the dataset is the varying size of the corpora for the different languages. Therefore, in addition to exploring classiﬁcation within each language in