Classification and Novel Class Detection of Data Streams in a Dynamic Feature Space Mohammad M. Masud 1 , Qing Chen 1 , Jing Gao 2 , Latifur Khan 1 , Jiawei Han 2 , and Bhavani Thuraisingham 1 1 University of Texas at Dallas 2 University of Illinois at Urbana Champaign {mehedy,qingch}@utdallas.edu, jinggao3@uiuc.edu lkhan@utdallas.edu, hanj@cs.uiuc.edu, bhavani.thuraisingham@utdallas.edu Abstract. Data stream classification poses many challenges, most of which are not addressed by the state-of-the-art. We present DXMiner, which addresses four major challenges to data stream classification, namely, infinite length, concept-drift, concept-evolution, and feature- evolution. Data streams are assumed to be infinite in length, which necessitates single-pass incremental learning techniques. Concept-drift occurs in a data stream when the underlying concept changes over time. Most existing data stream classification techniques address only the infi- nite length and concept-drift problems. However, concept-evolution and feature- evolution are also major challenges, and these are ignored by most of the existing approaches. Concept-evolution occurs in the stream when novel classes arrive, and feature-evolution occurs when new features emerge in the stream. Our previous work addresses the concept-evolution problem in addition to addressing the infinite length and concept-drift problems. Most of the existing data stream classification techniques, in- cluding our previous work, assume that the feature space of the data points in the stream is static. This assumption may be impractical for some type of data, for example text data. DXMiner considers the dy- namic nature of the feature space and provides an elegant solution for classification and novel class detection when the feature space is dy- namic. We show that our approach outperforms state-of-the-art stream classification techniques in classifying and detecting novel classes in real data streams. 1 Introduction The goal of data stream classification is to learn a model from past labeled data, and classify future instances using the model. There are many challenges in data stream classification. First, data streams have infinite length, and so, it is impossible to store all the historical data for training. Therefore, traditional learning algorithms that require multiple passes over the whole training data are not directly applicable to data streams. Second, data streams observe concept- drift, which occurs when the underlying concept of the data changes over time. A classification model must adapt itself to the most recent concept in order to J.L. Balc´azar et al. (Eds.): ECML PKDD 2010, Part II, LNAI 6322, pp. 337–352, 2010. c Springer-Verlag Berlin Heidelberg 2010