ACCURATE ESTIMATE OF THE CROSS-VALIDATED PREDICTION ERROR VARIANCE IN BAYES CLASSIFIERS Dimitrios Ververidis and Constantine Kotropoulos Dept. of Informatics, Aristotle Univ. of Thessaloniki, Box 451, Thessaloniki 54124, Greece. E-mails:{jimver,costas}@aiia.csd.auth.gr ABSTRACT A relationship between the variance of the prediction error com- mitted by the Bayes classifier and the mean prediction error was established by experiments in emotional speech classification wi- thin a cross-validation framework in a previous work. This paper theoretically justifies the validity of the aforementioned relation- ship. Furthermore, it proves that the new estimate of the variance of the prediction error, treated as a random variable itself, exhi- bits a much smaller variance than the usual estimate obtained by cross-validation even for a small number of repetitions. Accordin- gly, we claim that the proposed estimate is more accurate than the usual, straightforward, estimate of the variance of the prediction error obtained by applying cross-validation. 1. INTRODUCTION Two popular methods for estimating the prediction error of a clas- sifier are bootstrap and cross-validation. In these methods, the available dataset is divided repeatedly into a set used for desi- gning the classifier (i.e. the training set) and a set used for te- sting the classifier (i.e. the test set). By averaging the prediction error over all repetitions, hopefully a more accurate estimate of the prediction error is obtained than just using the prediction er- ror in one repetition of the experiment. Both cross-validation [1] and bootstrap [2] stem from the jackknife method. Jackknife was introduced by M. Quenouille for finding unbiased estimates of sta- tistics, such as the sample mean and the sample variance [3, 4]. Originally, jackknife meant to divide the dataset into two equal sets, to derive the target statistic over the two sets independently, and next to average the statistic estimates in order to obtain an un- biased statistic. Later, jackknife was about to split the dataset into many sets of equal cardinality. In another version, the statistic is estimated on the whole dataset except one sample, this procedure is repeated in a cyclic fashion, and the average estimate is found finally. The latter “leave-one-out” version dominates in practice. A review of jackknife variants can be found in [5]. The ordinary cross-validation is the extension of the jackknife method to derive an unbiased estimate of the prediction error in the “leave-one-out” sense [1]. The computational demands of the ordinary cross-validation are rising proportionally to the number of samples. A variant of cross-validation with a smaller number This work has been supported by the FP6 European Union Network of Excellence MUSCLE “Multimedia Understanding through Semantics, Computation and LEarning” (FP6-507752). of repetitions is the s-fold cross-validation. During s-fold cross- validation the dataset is divided into s roughly equal subsets, the samples in the s − 1 subsets are used for training the classifier, and the samples in the last subset are used for testing. The procedure is repeated for each one of the s subsets in a cyclic fashion and the prediction error is estimated by averaging the prediction errors measured in the test phase of the s repetitions. Burman proposed the repeated s-fold cross-validation for model selection, which is simply the s-fold cross-validation repeated many times [6]. The prediction error measured during cross-validation repetitions is a random variable that follows the Gaussian distribution. There- fore, according to the central limit theorem (CLT), the more the repetitions of the random variable are, the less varies the average prediction error. Throughout this paper, the repeated s-fold cross- validation is simply denoted as cross-validation for short. The outline of the paper is as follows. Section 2 deals with the prediction error committed by the Bayes classifier. A theoretical analysis of the factors that affect the variance of the prediction er- ror is made in Section 3. A comparison of the proposed method that predicts the variance of the cross-validated prediction error from a small number of repetitions against the usual estimate of the variance of the prediction error is presented in Section 4. Fi- nally, Section 5, concludes the paper by indicating future research directions. 2. CLASSIFIER DESIGN Let u W = {u W i } N i=1 be a set of N samples, where W = {w k } K k=1 is the feature set comprising K features w k . The samples can be considered as independent and identically distributed (i.i.d.) ran- dom variables (r.vs) distributed according to the multivariate di- stribution F of the feature set W. Each sample u W i =(y W i ,li ) is treated as a pattern consisting of a measurement vector y W i and a label li ∈{1, 2,...,C}, where C is the total number of classes. Let us predict the label of a sample by processing the fea- ture vectors using a classifier. Cross-validation (CV) calculates the mean over b = {1, 2,...,B} prediction error estimates as follows. Let s ∈{2, 3,...,N/2} be the number of folders the data should be divided into. To find the bth prediction error esti- mate, ND = s−1 s N samples are randomly selected without re- substitution from u W to build the design set u W Db while the remai- ning u W T b of N s samples forms the test set. The prediction error in CV repetition b is the error committed by the Bayes classifier. For sample u W i =(y W i ,li ) ∈ u W T b , the