Abstract—With the rapid growth of Internet technologies and applications, Text is still the most common Internet medium. Examples of this include social networking applications and web applications are also mostly text based. We developed a framework to determine an anonymous author’s native language for short length, multi-genre such as the ones found in many Internet applications. In this framework, four types of feature sets (lexical, syntactic, structural, and content-specific features) are extracted and three machine learning algorithms (C4.5 decision tree, support vector machine and Naïve Bayes) are designed for author’s native language identification based on the proposed features. To experiment this framework, we used English, Persian, Turkish and German online news texts. The experimental results showed that the proposed approach was able to identify author’s native language in web-based texts with satisfactory accuracy of 70% to 80%. And Support vector machines outperformed the other two classification techniques in our experiments. Index Terms—Native language identification, web-based texts, stylometry, classification techniques I. INTRODUCTION The rapid development and multiplication of Internet technologies and applications have created a new way to share information across time and space. Online social networking (such as Twitter, Facebook), e-commerce application (such as eBay, amazon), newsgroups, etc. are gaining more prominence. The Internet has usually utilized shorter forms of communication more easily than traditionally longer forms such as handwritten letters and essays. Authorship profiling, and may be concerned with determining an author’s gender, Native Language, age or some other attributes. We are particularly interested in this paper is the Native Language of an author, where this is not the language that the text is written in. In this paper, we propose a framework for author’s native language identification on Web-based texts. In this framework, four types of features that are identified in authorship-analysis research are extracted, and three major inductive learning techniques are used to build feature-based classification models to perform automated authorship profiling. Our framework address author Native Language identification from short Internet text, due to the following questions: Manuscript received February 25, 2012; revised April 28,2012 The authors are with the Department of Computer Engineering, Faculty of Engineering, Karadeniz Technical University,61080 Trabzon, TURKEY (e-mail: parham.tofighi@gmail.com) Can the author’s native language identification be applied to Web-based texts? Which types of writing-style features are effective for determining the author’s Native Language? Which classification techniques are effective for detecting the author’s Native Language in online texts? Most previous contributions on authorship attribution argued on determining the particular author from a set of candidate authors is possibly by looking at the documents that each author has written and matching a new document of unknown origin to a profile built of each author [1], [3], and [4]. New research direction grew out of the author profiling [5], and Author gender identification [2], [6]. Related work on Author’s Native language identification only applied Writers’ spelling and grammatical mistakes in traditional text (e.g. essays) that often influenced by patterns in their native language [7]. II. AUTHOR’S NATIVE LANGUAGE IDENTIFICATION In essence, Author’s native language identification problem is a classification problem that can be developed as follows: Provided a set of texts in English from authors with different native languages, and assign a new anonymous text to detect an author’s language class, namely. To implement this hypothesis a set of features that remains relatively constant for a large number of texts written by the authors of the same native language. Once the feature set has been chosen, a given text can be represented by an N- dimensional vector, where N is the total number of features. Given a set of pre-classified texts, we can apply many techniques to determine the category of a new vector created based on a new text. Characteristics of Web-based texts .Compared with conventional objects of native language identification such as published articles [7], one challenge of native language identification of web-based texts and messages is the limited length of online texts. The short length of online texts may cause some identifying features in normal texts to be ineffective. Analysis of a corpus of tens of thousands of English webpages indicates significant differences in writing style and content between native English writers and non-native writers [9]. Such differences can be exploited to determine an unknown author’s native language on the basis of a webpage’s corpus. Cyber users distribute messages mainly in English over cyberspace. Non-native English authors write in a systematic manner, and corpus of writings often depends on the first language of the writer [8], [9]. Author’s Native Language Identification from Web-Based Texts Parham Tofıghı, Cemal Köse, and Leila Rouka International Journal of Computer and Communication Engineering, Vol. 1, No. 1, May 2012 47