Lecture Notes on Linear Regression M´ ario A. T. Figueiredo, Departamento de Engenharia Electrot´ ecnica e de Computadores, Instituto Superior T´ ecnico, Lisboa, Portugal April 2007 1 Least Squares Linear Regression We are given a set of input-output pairs, T = {(x (1) ,y (1) ), ..., (x (n) ,y (n) )}, where each x (i) ∈ IR p and y (i) ∈ IR. The goal is to estimate a linear “regression function”of the form g(x, β,β 0 )= β 0 + β T x = β 0 + p  j =1 β j x j , where β =[β 1 , ..., β p ] T is the vector of “regression coeﬃcients”, β 0 is the “intercept coeﬃcient”, and x j denotes the j -component of vector x. The components of the x (i) are usually called the “explanatory variables” or “independent variables”. The classical estimation criterion is the minimization of the mean squared error on T , that is,   β,  β 0  = arg min β,β 0 n  i=1  y (i) − g(x (i) , β,β 0 )  2 = arg min β,β 0 n  i=1   y (i) − β 0 − p  j =1 β j x i,j   2 = arg min β,β 0 n  i=1  y (i) − β 0 − β T x (i)  2 , where x i,j is the j -th component of x (i) . We can start by getting rid of β 0 ; without loss of generality, we can assume that each regressor x j was “centered”by having its mean removed from the training set, that is, n  i=1 x i,j =0, (1) for j =1, ..., p, and that the “response” variables also have zero sample mean, n  i=1 y (i) =0. 1