Exact model averaging with naive Bayesian classifiers Denver Dash ddash@sis.pitt.edu Decision Systems Laboratory, Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA 15213 USA Gregory F. Cooper gfc@cbmi.upmc.edu Center for Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA 15213 USA Abstract The naive classifier is a well-established mathematical model whose simplicity, speed and accuracy have made it a popular choice for classification in AI and engineering. In this paper we show that, given N features of interest, it is possible to perform tractable exact model averaging (MA) over all 2 N pos- sible feature-set models. In fact, we show that it is possible to calculate parameters for a single naive classifier C ∗ such that C ∗ produces predictions equivalent to those ob- tained by the full model-averaging, and we show that C ∗ can be constructed using the same time and space complexity required to construct a single naive classifier with MAP parameters. We present experimental results which show that on average the MA classi- fier typically outperforms the MAP classifier on simulated data, and we characterize how the relative performance varies with number of variables, number of training records, and complexity of the generating distribution. Fi- nally, we examine the performance of the MA naive model on the real-world ALARM and HEPAR networks and show MA improved classification here as well. 1. Introduction The general supervised classification problem seeks to create a model based on labelled data which can be used to classify future vectors of features F = {F 1 ,F 2 ,...,F N } into one of various classes of inter- est. The naive classifier is a probabilistic model that accomplishes this goal by making the assumption that any feature F i ∈ F is conditionally independent of any other feature F j ∈ F given the value of the class vari- able C. The naive model can be represented by the Bayes net shown in Figure 1. C F 1 F 2 F 3 … F N … Figure 1. A naive network: C is the class node which can take on one value for each possible class, and the Fi denote features of interest. Naive classifiers have several desirable features: First, they are simple to construct, requiring very little do- main background knowledge, as opposed to general Bayesian networks which can require numerous inten- sive sessions with experts to produce the true depen- dence structure between features. Second, naive net- works have very constrained space and time complex- ity: If N f is the largest number of states any feature can take on, N c is the number of class states, and N is the total number of features, then constructing a net- work requires the estimation of a set of O (N · N f · N c ) parameters. Each of which can be estimated from data in time O (N r ), where N r is the number of records in the database. Given a completely specified naive net- work, classification of a new feature vector F ′ can be performed in time O (|F ′ |), even if F ′ is an incom- plete instantiation of features. Finally, despite their naivety, these classifiers have been shown to perform surprisingly well in practice, for example in (Domin- gos & Pazzani, 1997; Friedman et al., 1997), compar- ing favorably to state-of-the-art neural network clas- sifiers and even to the much more complex learning algorithms for general Bayes nets. The construction of a single maximum a posteriori (MAP) naive classifier given a set F of attributes re- quires only two general steps: (1) Select the subset of features F ′ ⊆ F judged to be relevant to classification,