Binning: Converting Numerical Classification into Text Classification Aynur A. Dayanik AYNUR@CS. RUTGERS. EDU Sofus A. Macskassy SOFMAC@CS. RUTGERS. EDU Haym Hirsh HIRSH@CS. RUTGERS. EDU Department of Computer Science, Rutgers University 110 Frelinghuysen Rd, Piscataway, NJ 08854-8019 USA Abstract Consider a supervised learning problem in which examples con- tain both numerical- and text-valued features. One common ap- proach to this problem would be to treat the presence or absence of a word as a Boolean feature, which when combined with the other numerical features enables the application of a range of tradi- tional feature-vector-based learning methods. This paper presents an alternative approach, in which numerical features are converted into “bag of word” features, enabling instead the use of a range of existing text-classification methods. Our approach creates a set of bins for each feature into which its observed values can fall. Two tokens are defined for each bin endpoint, representing which side of a bin’s endpoint a feature value lies. A numeri- cal feature is then assigned the bag of tokens appropriate for its value. Not only does this approach now make it possible to ap- ply text-classification methods to problems involving both numer- ical and text-valued features, even problems that contain solely numerical features can be converted using this representation so that text-classification methods can be applied. We therefore eval- uate our approach both on a range of real-world datasets taken from the UCI Repository that solely involve numerical features, as well as on additional datasets that contain both numerical- and text-valued features. Our results show that the performance of the text-classification methods using the binning representation of- ten meets or exceeds that of traditional supervised learning meth- ods (C4.5, k-NN, NBC, and Ripper), even on existing numerical- feature-only datasets from the UCI Repository, suggesting that text-classification methods, coupled with binning, can serve as a credible learning approach for traditional supervised learning problems. 1. Introduction Much research in machine learning has focused on super- vised learning tasks in which class-labeled examples are given to a learning algorithm that then predicts the class la- bel of newer unlabeled examples. One popular class of su- pervised learning problems involves learning from labeled text, such as to classify news stories into categories based on their content. A variety of machine learning and infor- mation retrieval techniques have been applied to such prob- lems (Billsus & Pazzani, 1999; Joachims, 1997). To ap- ply a traditional feature-vector-based supervised learning method, each word that may appear in the text is viewed as a binary feature that is true if it is present and false other- wise. In contrast, information retrieval methods base classi- fication on an analysis of the frequencies of the occurrences of the words (or phrases) in each document (Salton, 1991). In many problems examples involves both text and numerical-valued features. For example, the problem of classifying email messages into categories may depend not only on the words in each message, but also on other properties such as the length of the message or the time of day at which it was received (Macskassy, Dayanik & Hirsh, 1999; 2000). One approach for dealing with such data is to still treat each word as a feature, adding them to the other numerical-valued features to enable the application of feature-vector-based learning methods. This paper instead presents an approach by which num- bers are converted into a “bag of words” representation, so that traditional text classification methods can be ap- plied. Our approach creates bins into which each value may fall, and replaces a numerical feature with the bag of tokens representing the bins into which it falls. Thus, for example, consider an email classification task that includes a numerical feature representing the message’s length. We can artificially invent a set of bins, such as “less than 500”, “between 500 and 1000”, “between 1000 and 2000”, “between 2000 and 4000”, and “greater than 4000”. Each such bin has a lower and upper endpoint, and for each such endpoint we define two tokens, one for either side of the endpoint a value may lie. Thus, for example, we might generate for the length of a message the tokens “lengthunder500”, “lengthover500”, “length- under1000”, “lengthover1000”, etc. A new message of length 3000 would thereby have its length feature converted in the set of tokens “lengthunder4000”, “lengthover500”, “lengthover1000”, and “lengthover2000”. These would be added as “words” to the other words in the text, yielding a bag of words representation for the entire example suitable for text-classification methods. As will be discussed a bit further later in the paper, this approach bears some similarity to methods for discretiz- ing continuous features for decision-tree learning methods (e.g., Catlett 1991; Kerber 1992; Fayyad & Irani 1993; Dougherty, Kohavi and Sahami 1995; Kohavi & Sahami 1995; Frank and Witten, 1999). Rather than considering all possible splits between consecutive values for a given nu-