Research Statement Ricardo Silva Gatsby Computational Neuroscience Unit rbas@gatsby.ucl.ac.uk November 11, 2006 1 Philosophy My work lies on the intersection of computer science and statistics. The questions I want to answer are of the following nature: how can machines learn from experience? This raises questions about statistical modeling, since the nature of a phenomenon is only observable through a limited set of measurements: the data. Rather than explicitly programming a computer to perform a particular task, machine learning uses data and statistical models to achieve intelligent behavior. The outcome can be observed in tasks as diverse as: predicting user preferences (movie ratings are fashionable these days 1 ); filtering spam; adapting models of computer vision and speech recognition to new environments; improving retrieval of important documents; improving machine translation; and many others. We can also turn the question around and ask instead how machines can be used in new methods of data analysis, and improve scientific progress. Standard statistical practice focuses on studies with a small number of variables and data points, but the increase in the amount of data that has been collected is evident. The need for analysing high dimensional measurements, and combining different sources of data, is pressing. Now the issue turns to finding proper computational approaches for building models from data, and providing novel techniques for exploration and analysis within more thorough studies. In particular, my research addresses fundamental questions on learning with graphical models. More precisely, models with hidden (latent) variables. Such models are appropriate when the observed associations in our data are due to hidden common causes of our measured variables. This happens, for instance, if the observations are sensor data measuring atmospherical phenomena, medical instruments measuring biological processes, econometrical indicators measuring economical processes and so on. Reyment and Joreskog (1996) and Bollen (1989) provide an extensive list of examples of this class. Graphical models are a powerful language for expressing conditional independence constraints, a necessity if one aims to model large dimensional domains (Jordan, 1998). Graphical models also provide a language for causal modeling, as required if one needs to compute the effects of interventions. Examples of interventions are medical treatments, genetic engineering, public policy issues such as tax cuts, and marketing strategies, among others (Spirtes et al., 2000; Pearl, 2000). I believe that the best approach in solving a real problem lies in a careful statistical formulation of the question, identifying how to best use parametric and nonparametric statistical principles, which dependencies are necessary, which hidden variables could or should be used to model the observable phenomena, and finally, which computational methods should be applied. Although a crucial component of any machine learning solution, I do not believe algorithms should be the starting point of any learning framework: my philosophy is to write down which family of models should be the most appropriate for that domain, and only then concentrate on how to compute the desired predictions or model selection criteria. When computational limits are reached, one should approximate what is, to the best of our knowledge, the correct model. 1 http://www.netflixprize.com 1