Using LSI for Text Classification in the Presence of Background Text Sarah Zelikovitz Rutgers University 110 Frelinghuysen Road Piscataway, NJ 08855 zelikovi@cs.rutgers.edu Haym Hirsh Rutgers University 110 Frelinghuysen Road Piscataway, NJ 08855 hirsh@cs.rutgers.edu ABSTRACT This paper presents work that uses Latent Semantic Indexing (LSI) for text classification. However, in addition to relying on labeled training data, we improve classification accuracy by also using un- labeled data and other forms of available “background” text in the classification process. Rather than performing LSI’s singular value decomposition (SVD) process solely on the training data, we in- stead use an expanded term-by-document matrix that includes both the labeled data as well as any available and relevant background text. We report the performance of this approach on data sets both with and without the inclusion of the background text, and compare our work to other efforts that can incorporate unlabeled data and other background text in the classification process. 1. INTRODUCTION The task of classifying textual data is both difficult and intensively studied [11, 16, 14]. Traditional machine learning programs use a training corpus of often hand-labeleddata to classify new test exam- ples. Training sets are sometimes extremely small, due to the diffi- cult and tedious nature of labeling, and decisions can therefore be difficult to make with high confidence. Given the huge number of unlabeled examples, articles, Web sites, and other sources of information that often exist, it would be useful to take advantage of this additional information in some automatic fashion. These sources can be looked at as background knowledge that can aid in the classification task. For example, a number of re- searchers have explored the use of large corpora of unlabeled data to augment smaller amounts of labeled data for classification (such as augmenting a collection of labeled Web-page titles with large amounts of unlabeledWeb-page titles obtained directly from the Web). Nigam et al. [14] use such background examples by first labeling them us- ing a classifier formed on the original labeled data, adding them to the training data to learn a new classifier on the resulting expanded data and then repeating anew the labeling of the originally unlabeled data. This approach yielded classification results that exceed those obtained without the extra unlabeled data. Blum and Mitchell’s [3] co-training algorithm also applies to cases where there is a source of unlabeled data [13], only in cases where the target concept can be described in two redundant ways (such as through two different subsets of attributes describing each exam- ple). Each view of the data is used to create a predictor, and each predictor is used to classify unlabeled data. The data labeled by one classifier can then be used to train the other classifier. Blum and Mitchell prove that under certain conditions, the use of unlabeled examples in this way is sufficient to PAC-learn a concept given only an initial weak learner. A second example of backgroundknowledge concerns cases where the data to which the learned classifier will be applied is available at the start of learning. For such learning problems, called transduc- tive learning [12], these unlabeled examples may also prove helpful in improving the results of learning. For example, in transductive support vector machines [12, 1] the hyperplane that is chosen by the learner is based on both the labeled training data and the unlabeled test data. Joachims shows that this is a method for incorporating priors in the text classification process and performs well on such tasks. Zelikovitz and Hirsh [18] consider an even broader range of back- ground text for use in classification, where the background text is hopefully relevant to the text classification domain, but doesn’t nec- essarily take the same general form of the training data. For exam- ple, a classification task given labeled Web-page titles might have access to large amounts of Web-page contents. Rather than viewing these as items to be classified or otherwise manipulated as if they were unlabeled examples, the pieces of background knowledge are used as indices into the set of labeled training examples. If a piece of background knowledge is close to both a training example and a test example, then the training example is considered close to the test example, even if they do not share any words. The background text provides a mechanism by which a similarity metric is defined, and nearest neighbor classification methods can be used. Zelikovitz and Hirsh [18] show that their approach is especially useful in cases with small amounts of training data and when each item in the data has few words. This paper describes yet another way of using this broader range of background knowledge to aid classification. It neither classifies the background knowledge, nor does it directly compare it to any train- ing or test examples. Instead, it exploits the fact that knowing that certain words often co-occur may be helpful in learning, and that this could be discovered from large collections of text in the domain.