Location and Scatter Matching for Dataset Shift in Text Mining Bo Chen * , Wai Lam * , Ivor Tsang , Tak-Lam Wong * * The Chinese University of Hong Kong Email: {bchen,wlam}@se.cuhk.edu.hk, wongtl@cse.cuhk.edu.hk Nanyang Technological University Email: IvorTsang@ntu.edu.sg Abstract—Dataset shift from the training data in a source domain to the data in a target domain poses a great challenge for many statistical learning methods. Most algorithms can be viewed as exploiting only the first-order statistics, namely, the empirical mean discrepancy to evaluate the distribution gap. Intuitively, considering only the empirical mean may not be statistically efficient. In this paper, we propose a non- parametric distance metric with a good property which jointly considers the empirical mean (Location) and sample covariance (Scatter) difference. More specifically, we propose an improved symmetric Stein’s loss function which combines the mean and covariance discrepancy into a unified Bregman matrix diver- gence of which Jensen-Shannon divergence between normal distributions is a particular case. Our target is to find a good feature representation which can reduce the distribution gap between different domains, at the same time, ensure that the new derived representation can encode most discriminative components with respect to the label information. We have conducted extensive experiments on several document classifi- cation datasets to demonstrate the effectiveness of our proposed method. Keywords-Domain Adaptation, Feature Extraction I. INTRODUCTION Traditional statistical learning algorithms are constructed under the basic assumption that the training data is generated by exactly the same distribution with the testing data. In many real-world text mining problems, we wish to deploy the method to different domains. To cope with the issue of varying data distribution for different domains, we need to collect sufficient labeled data for each domain to learn the model. However, it is often impractical or costly. In order to reduce the annotation effort for labeling in different domains, we might want to adapt the model learned from one specific domain with labeled data, known as the source domain, to other domains known as target domains where only unlabeled data is available. Distinction between training and testing distributions in a learning problem has been referred to as sample selection bias [1] or covariate shift [2], [3]. Sample selection bias actually refers to the fact that the training instances are originally drawn from the testing distribution, but sampled as training data with probability. Covariate shift is a particular sample selection bias which allows different distributions of the instances between the training and testing set, but it assumes that the conditional probabilities of the label variables given an instance remain unchanged. There are two main approaches to removing the bias in the sample selection procedure. The first approach can be referred to as instance- level approach [1], [2], [3] which infers the re-sampling weight of training samples by matching the distributions between training and testing sets in the original feature space. Another approach can be referred to as feature- level approach [4], [5], which tries to learn an optimal feature representation where the marginal distributions be- tween the data in different domains are closely matched. Both approaches try to reduce the distribution gap between the training and testing set so as to propagate the label information, which have been proved to be effective in various applications. Currently, most existing instance-level and feature-level approaches are restricted to the first-order statistics matching to enforce the empirical means of the training and testing instances are closer in a Reproducing Kernel Hilbert Space (RKHS). Intuitively, they may have a considerable limitation in matching two probability distributions where only the first-order statistics are similar. Moreover, for many text mining applications, it is not appropriate to ignore the feature dependency which can be explored by considering the document/instance covariance. This motivates us to uti- lize the covariance information to evaluate the distribution discrepancy. First it can strengthen the distribution match- ing criterion than only considering the mean. The second advantage is that we can utilize the term dependency to distinguish domain specific features and common features, and then filter such features whose similarity with other features varies greatly from the training data to the testing data by investigating the sample covariance matrices. In this paper, in order to overcome the limitations men- tioned above, we develop a new method called LSM that is composed of a non-parametric distance metric with a good property which jointly considers the empirical mean (Location) and sample covariance (Scatter) difference. More specifically, we propose an improved symmetric Stein’s loss function which combines the mean and covariance discrep- ancy into a unified Bregman matrix divergence of which Jensen-Shannon divergence between normal distributions is a particular case. Our target is to find a good feature representation which can reduce the embedded distribution