Continuous Authentication using Micro-Messages Marcelo Luiz Brocardo, Issa Traore Department of Electrical and Computer Engineering University of Victoria - UVIC Victoria, British Columbia, Canada {marcelo.brocardo, itraore}@ece.uvic.ca Abstract—Authorship veriﬁcation consists of checking whether a target document was written or not by a speciﬁc individual. In this paper, we study the problem of authorship veriﬁcation for Continuous Authentication (CA) purposes. Different from traditional authorship veriﬁcation that focuses on long texts, we tackle the use of micro-messages. Shorter authentication delay (i.e. smaller data sample) is essential to reduce the window size of the re-authentication period in CA. We explored lexical, syntactic, and application speciﬁc features. We investigated two different classiﬁcation schemes: on one hand Logistic Regression (LR) and on the other hand an hybrid classiﬁer combining Support Vector Machine (SVM) and LR. Experimental evaluation based on the Enron email dataset involving 76 authors and Twitter dataset involving 100 authors yield very promising results consisting of Equal Error Rates (EER) of 9.18% and 11.83%, respectively. Keywords—Continuous authentication; biometrics systems; au- thorship veriﬁcation; stylometry; short message veriﬁcation. I. I NTRODUCTION Static authentication where user identity is checked once at login time can be circumvented no matter how strong the authentication mechanism is. Through attacks such as man-in-the-middle and its variants, an authenticated session can be hijacked later after the initial login process has been completed. In the last decade, continuous authentication (CA) using biometrics has emerged as a possible remedy against session hijacking. CA consists of testing the authenticity of the user repeatedly throughout the authenticated session as data becomes available. CA is expected to be carried out unobtrusively, due to its repetitive nature, which means that the authentication informa- tion must be collectible without any active involvement of the user and without using any special purpose hardware devices (e.g. biometric readers). Emerging behavioural biometric technologies such as mouse dynamics and keystroke dynamics biometrics are good candidates for CA because data can be collected passively using standard computing devices (e.g. mouse and keyboard) throughout a session without any knowledge of the user [1]. We believe that developing continuous authentication approach based on authorship analysis can achieve the same purpose and will contribute to improving the veriﬁcation accuracy while maintaining acceptable authentication delays. The linguistic characteristics used to identify the author of a text is referred to as stylometry [2], [3]. Stylometry analysis consists of inferring the authorship of a document by extracting and analyzing the writing styles or stylometric features from a document content. A number of studies on stylometry have been published [4]–[7]. These papers typically target three different problems, including, authorship attribution or identiﬁcation, authorship veriﬁcation, and authorship proﬁling or characterization. Au- thorship attribution consists of determining the most likely author of a target document among a list of known individuals. Authorship veriﬁcation consists of checking whether a target document was written or not by a speciﬁc individual. Author- ship proﬁling or characterization consists of determining the characteristics (e.g. gender, age, and race) of the author of an anonymous document. While the existing work have shown promising results when extracting writing styles from long documents, their accuracy was considerably low when analyzing short docu- ments. Short documents analysis is challenging because they are poorly structured and use informal language, as opposed to literary documents. In this paper, we compare sample writing of an individ- ual against the model or proﬁle associated with the identity claimed by that individual at login time (i.e. 1-to-1 identity matching), which is very similar to authorship veriﬁcation. The use of micro-messages (i.e. smaller data sample) allows shorter authentication delay, which is essential to reduce the window of vulnerability of the system. We identify a set of new features including lexical, syntactic, and application speciﬁc features. For classiﬁcation, we investigate in this paper an hybrid classi- ﬁer that combines two different traditional classiﬁers, namely, Support Vector Machine (SVM) and Logistic Regression (LR). We evaluate our model using the Enron emails dataset and a micro-messages dataset based on Twitter feeds with 76 and 100 authors respectively. To our knowledge, it is the ﬁrst time that Twitter dataset is used for authorship veriﬁcation. We evaluate experimentally our approach by computing the following performance metrics: ● False Acceptance Rate (FAR): measures the likelihood that the system will fail to recognize the genuine person; ● False Rejection Rate (FRR): measures the likelihood that the system may falsely recognize someone as the genuine person; ● Equal Error Rate (ERR): corresponds to the operating point where FAR and FRR have the same value. Our evaluation yields an EER of 9.18% using Enron dataset as a test and 11.83% using the micro-message corpus from Twitter as a test. The rest of the paper is structured as follows. Section II