Knowledge Mining with Genetic Programming Methods for Variable Selection in Flavor Design Katya Vladislavleva University of Antwerp Belgium katya@vanillamodeling.com Kalyan Veeramachaneni Massachusetts Institute of Technology Cambridge, MA kalyan@csail.mit.edu Matt Burland Givaudan Flavors Corp. Cincinnati, OH matt.burland@givaudan.com Jason Parcon Givaudan Flavors Corp. Cincinnati, OH jason.parcon@givaudan.com Una-May O’Reilly Massachusetts Institute of Technology Cambridge, MA unamay@csail.mit.edu ABSTRACT This paper presents a novel approach for knowledge min- ing from a sparse and repeated measures dataset. Genetic programming based symbolic regression is employed to gen- erate multiple models that provide alternate explanations of the data. This set of models, called an ensemble, is gener- ated for each of the repeated measures separately. These multiple ensembles are then utilized to generate informa- tion about, (a) which variables are important in each en- semble, (b) cluster the ensembles into different groups that have similar variables that drive their response variable, and (c) measure sensitivity of response with respect to the im- portant variables. We apply our methodology to a sensory science dataset. The data contains hedonic evaluations (lik- ing scores), assigned by a diverse set of human testers, for a small set of flavors composed from seven ingredients. Our approach: (1) identifies the important ingredients that drive the liking score of a panelist and (2) segments the panelists into groups that are driven by the same ingredient, and (3) enables flavor scientists to perform the sensitivity analysis of liking scores relative to changes in the levels of important ingredients. Categories and Subject Descriptors I.1.2 [Computing Methodologies]: Symbolic and Alge- braic Manipulation—algorithms General Terms Algorithms, Design, Experimentation Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. GECCO’10, July 7–11, 2010, Portland, Oregon, USA. Copyright 2010 ACM 978-1-4503-0072-8/10/07 ...$10.00. Keywords variable selection, ensemble modeling, sensory science, ge- netic programming, symbolic regression 1. INTRODUCTION Variable selection is a process of identifying influential variables (attributes) that are discriminative and are nec- essary to describe a real or a simulated system and its per- formance characteristics. Understanding the relative impor- tance of variables makes a design problem tractable by re- ducing the dimensionality of the original problem. It short- ens the design time by facilitating insights and improves the generalization power of models. These factors usually drive the product costs down. In this paper, we consider variable selection in datasets that are sparse and contain repeated measures because they present unique challenges for variable selection. Consider a set of explanatory variables x = {x1 ...xn}, a response variable y and an unknown function F that relates x to y. Sparsity implies that very few data samples that explain F are available relative to the number of the explanatory variables, i.e. n. The dataset contains repeated measures, if the same samples are passed to different measuring functions (or responses) that are denoted as Fs for s =1 ...l. If there is large variance for 1 sample’s responses, this implies that there is no one single model for the entire dataset and one has to build a model for each Fs( x) measuring function. In this paper, we adopt an ensemble based symbolic re- gression approach to provide multiple unbiased explanations of the input-output relationships in the data. There are sev- eral known advantages of symbolic regression over paramet- ric regression. For example, symbolic regression can handle dependent and correlated variables and automatically dis- cover various appropriate and diverse models. However, the multiple model generating capability of genetic programming (GP) is the strongest argument for using symbolic regression on sparse data sets. To our surprise it is often ignored (or taken for granted) and a GP with single-objective fitness