Active Learning of Hyperparameters: An Expected Cross Entropy Criterion for Active Model Selection Johannes Kulick Robert Lieck Marc Toussaint August 1, 2021 Abstract In standard active learning, the learner’s goal is to reduce the predictive uncertainty with as little data as possible. We consider a slightly different problem: the learner’s goal is to uncover latent properties of the model—e.g., which features are relevant (“active feature selection”), or the choice of hyper parameters—with as little data as possible. While the two goals are clearly related, we give examples where following the predictive uncertainty objective is suboptimal for uncovering latent parameters. We propose novel measures of information gain about the latent parameter, based on the divergence between the prior and expected posterior distribution over the latent parameter in question. Notably, this is different from applying Bayesian experimental design to latent variables: we give explicit examples showing that the latter objective is prone to get stuck in local minima, unlike its application the standard predictive uncertainty. Extensive evaluations show that active learning using our measures significantly accelerates the uncovering of latent model parameters, as compared to standard version space approaches (Query-by-committee) as well as predictive uncertainty measures. 1 Introduction It is often the case that multiple statistical models are candidates for describing the data. Model selection tries to find the correct model class by applying various criteria or statistical methods like cross-validation (Akaike, 1974; Schwarz, 1978; Kohavi, 1995). However, existing methods to select the best model from a set of candidates rely on a given batch of data. While this is reasonable for a wide range of applications, where gathering labeled data is comparatively easy, traditional model selection methods do not directly transfer to an active learning scenario, where the labeling of data is very expensive and time consuming. In active learning (Settles, 2012) the learning algorithm chooses its own data queries to optimize learning success. Active learning is very successful in reducing the number of training data. However, while the data acquired by active learning methods can also be used for model selection this is not the most efficient way to actively choose data for model selection. In this paper we transfer active learning methods to the model selection context. This means that we want to actively choose samples to best discriminate between competing model classes or hypotheses. Generally, our problem setting is relevant whenever an early decision about latent parameters is more important than the reduction in predictive uncertainty itself—e.g. when the primary task is to infer the model class, the relevant features or hyperparameters. However, we will provide evidence that active model selection may also improve the learning performance w.r.t. predictive uncertainty. This work is motivated by our observation that a straight-forward transfer of minimizing- predictive-uncertainty to minimizing-model-uncertainty—which is also utilized by Bayesian ex- perimental design (discussed in detail below)—may fail in the iterative case: it is prone to get stuck in local minima and exhibits the somewhat human behavior of always choosing queries that 1 arXiv:1409.7552v1 [stat.ML] 26 Sep 2014