UCLA Computer Science Department Technical Report CSD-TR No. 030056 1 Learning Naive Bayes Classiﬁer from Noisy Data Yirong Yang, Yi Xia, Yun Chi, and Richard R. Muntz University of California, Los Angeles, CA 90095, USA {yyr,xiayi,ychi,muntz}@cs.ucla.edu Abstract. Classiﬁcation is one of the major tasks in knowledge dis- covery and data mining. Naive Bayes classiﬁer, in spite of its simplic- ity, has proven surprisingly eﬀective in many practical applications. In real datasets, noise is inevitable, because of the imprecision of measure- ment or privacy preserving mechanisms. In this paper, we develop a new approach, LinEar-Equation-based noise-aWare bAYes classiﬁer (LEE- WAY ), for learning the underlying naive Bayes classiﬁer from noisy ob- servations. Using linear system of equations and optimization methods, LEEWAY reconstructs the underlying probability distributions of the noise-free dataset based on the given noisy observations. By incorpo- rating the noise model into the learning process, we improve the classi- ﬁcation accuracy. Furthermore, as an estimate of the underlying naive Bayes classiﬁer for the noise-free dataset, the reconstructed model can be easily combined with new observations that are corrupted at diﬀerent noise levels to obtain a good predictive accuracy. Several experiments are presented to evaluate the performance of LEEWAY. The experimen- tal results show that LEEWAY is an eﬀective technique to handle noisy data and it provides higher classiﬁcation accuracy than other traditional approaches. keywords: naive Bayes classiﬁer, noisy data, classiﬁcation, Bayesian network. 1 Introduction Classiﬁcation is one of the major tasks in knowledge discovery and data mining. Naive Bayes classiﬁer, in spite of its simplicity, has proven surprisingly eﬀective in many practical applications, including natural language processing, pattern classiﬁcation, medical diagnosis and information retrieval [12]. The input dataset for naive Bayes classiﬁer is a set of structured tuples comprised of <feature vec- tor, class value> pairs. The fundamental assumption of naive Bayes classiﬁer is that the feature variables are conditionally independent given the class value. This classiﬁer learns from the training dataset the conditional probability distri- bution of each feature variable X i given the class value c. Given a new instance <x 1 ,x 2 , ..., x n > of the feature vector <X 1 ,X 2 , ..., X n >, the goal of the classi- ﬁcation then is to predict its class value c with the highest posterior probability P (C = c|x 1 ,x 2 , ..., x n ). The classiﬁcation accuracy depends not only on the learning algorithm, but also on the quality of the input dataset. In a real dataset, noise is inevitable,