Kernel Methods for fMRI Pattern Prediction – applications of Relevance Vector Regression and Kernel Ridge Regression Chia-Yueh Carlton Chu 1 , Yizhao Ni 2 , Geoffrey Tan 1 , John Ashburner 1 1 Functional Imaging Laboratory, ION, UCL, London, UK. 2 ISIS Group, School of Electronics and Computer Science, University of Southampton. General Kernel Regression We denote each scans as a vector x i, . The kernel is a similarity measure between scans. For a linear kernel, it is the dot product between two scans. i n i N n n i b k ε + + = ∑ = ) , ( 1 x x w y During the optimization, many elements of α will approach infinity. This means the corresponding values of w approach zero. The final result is a sparse w. Results We are in the top 3 group (scores >0.75). The results of KRR and RVR are very similar. For “dog” and “Interior”, we used the mask for auditory and visual cortex. Emotional ratings seemed to be predicted better by non-linear kernel, and most ratings can be predicted well by a linear kernel. The advantage of using a linear kernel is the weight map, which indicates the contribution of each voxels, can be generated by sum of weighted training images. This map may show interesting multivariate patterns which may not present in mass univariate analysis. ∑ = = N n n n w map Weight 1 x Reference [1] John Shawe-Taylor, Nello Cristianini. Kernel Methods for Pattern Analysis, Cambridge University Press, 2004, p80-82, p290-293 [2] M. E. Tipping, Sparse Bayesian Learning and the Relevance Vector Machine, Journal of Machine Learning Research (2001), 1,211-244 [3] Christopher M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006. p293, p345-356 Abstract The procedure involved ridge alignment and detrending with high pass filter, then applying multivariate linear pattern recognition approaches (i.e. kernel ridge regression (KRR), relevance vector regression (RVR)). The whole fMRI sequences were used, and training was done with the HRF convolved scores. Various pre processing procedures were used (ROI, smoothing). A quadratic programming procedure was then carried to perform a constrained deconvolution, and re-convolution, then smoothing with the Gaussian kernel. This year, we predicted 4 ratings nearly perfect (correlation ~0.99). Introduction The non-preprocessed images provided by the competition were pre-processed by SPM5. The low frequency drift was removed by modeling with 8 discrete cosine transform (DCT) basis functions. A couple of approaches were attempted in order to reduce the dimensionality of the data. The first of these involved masking out all voxels that were not grey matter. Spatial smoothing of the data was also tried (Gaussian FWHM 6mm). Cross-validation was used to select the regularization parameter for KRR. This involved training with the first video, and testing with the second, then vice versa. The training and testing were done with both KRR [1] and RVR[2,3]. We employed KRR for the first submission and RVR for the second. Non of both methods is superior than the other. (We suppose it is because sometimes over-regularize KRR will reduce the variance of the prediction, hence increase the correlation. Practically, the results of KRR and RVR are very close). For our 3 rd submission, we submitted the best results of the first 2 submissions. Method 200 400 600 800 1000 1200 1400 1600 1800 2000 200 400 600 800 1000 1200 1400 1600 1800 2000 j i j i k x x x x T = ) , ( A Gram Matrix (linear kernel) of subject13 The predicted value y i is a linear combination of the kernel entry of the input image x i with the training images, plus a bias. Note: There is no bias for KRR. Kernel Ridge Regression For kernel ridge regression, the w is obtain by: training training λ k y I w 1 ) ( − + = y is the rating of the training scans, and λ is the regularization parameter, which is normally determined by cross validation. ) || || exp( ) , ( 2 j i j i rbf k x x x x − − = γ d j i j i poly k ) ( ) , ( x x x x T + = θ Linear Kernel Non-linear Kernel Relevance Vector Regression RVR is formulated in a Bayesian framework, and involve estimating a marginal likelihood (ML-II) solution of a vector of hyper-parameters α, σ. These hyper-parameters are then used to estimate the best weights w. Given a data set of input-target pairs. {x n ,y n } N n=1 , considering scalar- valuged target functions only. x n is the voxels of the image volume. y n is the rating convolved with the haemodynamic function.[2] ) , 0 ( ) ( ) , 0 ( ) ( 2 1 σ ε α N p N w p i i i = = − Model Specification w α | w w | y α y d p p p ∫ = ) ( ) , ( ) , | ( 2 2 σ σ Objective Function Subject14, Instruction Subject14, Velocity 0.7~0. 8 0.45 0.3~0. 5 0.3~0.5 0.75 0.55 0.98 0.99 0.99 0.99 0.7 0~0.3 0~0.3 Velocity Interior Weapons Tools Fruit Vegetables Faces Dog Instruct ions Search Fruit Search Weapons Search People Hits Valence Arousal The estimation of correlation for each rating based on cross validation and our final prediction