Dynamic Feature Space and Incremental Feature Selection for the Classification of Textual Data Streams Ioannis Katakis, Grigorios Tsoumakas, and Ioannis Vlahavas Department of Informatics, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece {katak,greg,vlahavas}@csd.auth.gr Abstract. Real world text classification applications are of special inter- est for the machine learning and data mining community, mainly because they introduce and combine a number of special difficulties. They deal with high dimensional, streaming, unstructured, and, in many occasions, concept drifting data. Another important peculiarity of streaming text, not adequately discussed in the relative literature, is the fact that the feature space is initially unavailable. In this paper, we discuss this aspect of textual data streams. We underline the necessity for a dynamic fea- ture space and the utility of incremental feature selection in streaming text classification tasks. In addition, we describe a computationally un- demanding incremental learning framework that could serve as a baseline in the field. Finally, we introduce a new concept drifting dataset which could assist other researchers in the evaluation of new methodologies. 1 Introduction The world wide web is a dynamic environment that offers many sources of con- tinuous textual data, such as web pages, news-feeds, emails, chat rooms, forums, usenet groups, instant messages and blogs. There are many interesting applica- tions involving classification of such textual streams. The most prevalent one is spam filtering. Other applications include filtering of pornographic web pages for safer child surfing and delivering personalized news feeds. All these applications present great challenge for the data mining community mainly because they introduce and/or combine a number of special difficulties. First of all, the data is high dimensional. We usually consider as feature space a vocabulary of hundreds of thousands of words. Secondly, data in such appli- cations always come in a stream, meaning that we cannot store documents and we are able to process them only upon their arrival. Thirdly, the phenomenon of concept drift [8] might appear. This means that the concept or the distribution of the target-class in the classification problem may change over time. In this paper we tackle with another issue that, to the best of our knowledge, haven’t been given enough attention by the research community. This is, the