Dynamic Feature Space and Incremental Feature Selection for the Classiﬁcation of Textual Data Streams Ioannis Katakis, Grigorios Tsoumakas, and Ioannis Vlahavas Department of Informatics, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece {katak,greg,vlahavas}@csd.auth.gr Abstract. Real world text classiﬁcation applications are of special inter- est for the machine learning and data mining community, mainly because they introduce and combine a number of special diﬃculties. They deal with high dimensional, streaming, unstructured, and, in many occasions, concept drifting data. Another important peculiarity of streaming text, not adequately discussed in the relative literature, is the fact that the feature space is initially unavailable. In this paper, we discuss this aspect of textual data streams. We underline the necessity for a dynamic fea- ture space and the utility of incremental feature selection in streaming text classiﬁcation tasks. In addition, we describe a computationally un- demanding incremental learning framework that could serve as a baseline in the ﬁeld. Finally, we introduce a new concept drifting dataset which could assist other researchers in the evaluation of new methodologies. 1 Introduction The world wide web is a dynamic environment that oﬀers many sources of con- tinuous textual data, such as web pages, news-feeds, emails, chat rooms, forums, usenet groups, instant messages and blogs. There are many interesting applica- tions involving classiﬁcation of such textual streams. The most prevalent one is spam ﬁltering. Other applications include ﬁltering of pornographic web pages for safer child surﬁng and delivering personalized news feeds. All these applications present great challenge for the data mining community mainly because they introduce and/or combine a number of special diﬃculties. First of all, the data is high dimensional. We usually consider as feature space a vocabulary of hundreds of thousands of words. Secondly, data in such appli- cations always come in a stream, meaning that we cannot store documents and we are able to process them only upon their arrival. Thirdly, the phenomenon of concept drift [8] might appear. This means that the concept or the distribution of the target-class in the classiﬁcation problem may change over time. In this paper we tackle with another issue that, to the best of our knowledge, haven’t been given enough attention by the research community. This is, the