Competitive Online Generalised Linear Regression with Multidimensional Outputs Raisa Dzhamtyrova Department of Computer Science Royal Holloway, University of London Egham, United Kingdom Raisa.Dzhamtyrova@rhul.ac.uk Yuri Kalnishkan Department of Computer Science Royal Holloway, University of London Laboratory of Advanced Combinatorics and Network Applications Moscow Institute of Physics and Technology Egham, United Kingdom Yuri.Kalnishkan@rhul.ac.uk Abstract—We apply online prediction with expert advice to construct a universal algorithm for multi-class classification problem. Our experts are generalised linear regression models with multidimensional outputs, i.e. neural networks with multiple output nodes but no hidden nodes. We allow the final layer transfer function to be a softmax function with linear activations to all output neurons. We build an online algorithm competitive with all the experts of relevant models of this type and derive an upper bound on the cumulative loss of the algorithm. We carry out experiments on three data sets and compare cumulative losses of our algorithm and a single neuron with multiple output nodes. I. I NTRODUCTION We consider the online setting in which predictions and outcomes are given step-by-step. Contrary to batch mode, where the algorithm is trained on a training set and gives predictions on a test set, we learn as soon as new observations become available. For example, suppose that our aim is to predict outcomes of football matches of a new season based on the data available from a previous season. A batch algorithm builds a model on a previous season’s data and this model is used to make predictions for all matches of the current season. In the online setting we add data sequentially and adjust parameters of the model after each match. More formally, we consider an online protocol where at each trial t =1, 2,... a learner observes x t and attempts to predict an outcome y t , which is shown to the learner later. The performance of the learner is measured by means of the cumulative loss. The main goal of this paper is to develop a universal algorithm which will be competitive with all generalised linear regression models with multidimensional outputs. For this purpose we use the method of online prediction with expert ad- vice. At each trial we have access to predictions of experts and need to make a prediction based on experts’ past performance. In statistical learning, usually some assumptions are made about the mechanism that generates the data, and guarantees are given for the method working under these assumptions. For example, one may assume a linear dependence between electricity consumption and temperature and try to fit the best parameters for linear regression. We consider the adversarial setting where no assumptions are made about the data generat- ing process. In this paper we consider competitive prediction when one provides guarantees compared to other predictive models that are called experts. Experts could be real-human experts, complex machine learning algorithms or even classes of functions. Our goal is to develop a merging strategy that will perform not much worse than the retrospectively best expert. As a result, we do not try to build a model that works under certain assumptions but try to combine predictions that are given to us by experts. One may wonder why we don’t just use predictions of only one best expert from the beginning and ignore predictions of others. First, sometimes we cannot have enough data to identify the best expert from the start. Second, good performance in the past does not necessary lead to good performance in the future. In this paper we consider our experts to be generalised linear regression models with multidimensional outputs, which can be seen as neural networks with multiple output nodes but no hidden nodes. We allow a final layer transfer function to be a softmax function with linear activations to all output neurons. Each expert follows a particular strategy, which means that it chooses some particular parameters of the softmax function. Our goal is to develop a merging strategy that will perform not much worse than the best expert. In this paper we consider the method of online prediction with expert advice. Online convex optimization is a similar area where a decision-maker makes a sequence of decisions from a fixed feasible set. After each point is chosen, it encoun- ters a convex cost function. In [4] a logarithmic regret bound is obtained for α-convex cost functions. However, a second derivative of our cost function is not lower bounded, therefore this analysis is not applicable here. A similar problem was considered in [6], where the authors proposed a general additive algorithm based on gradient descent and derived loss bounds that compared the loss of the resulting online algorithm to the best offline predictor from the relevant model class. They considered a softmax transfer function (Example 4 in [6]) and achieved a theoretical bound with a multiplicative coefficient of two in front of the loss of the best expert, whereas we achieved a multiplicative coefficient of one, which indicates that our theoretical bound is better for large losses. We will consider an approach based on the Aggregating Algorithm (AA), which was first introduced in [8]. AA works brought to you by CORE View metadata, citation and similar papers at core.ac.uk provided by Royal Holloway - Pure