KAFE: Automated Feature Enhancement for Predictive Modeling using External Knowledge Sainyam Galhotra 1* Udayan Khurana 2 Oktie Hassanzadeh 2 Kavitha Srinivas 2 Horst Samulowitz 2 1 University of Massachusetts Amherst, 2 IBM Research sainyam@cs.umass.edu,{ukhurana, hassanzadeh, kavitha.srinivas, samulowitz}@us.ibm.com Abstract The efficacy of a supervised learning model depends on the predictive ability of the underlying features. A crucial means of improving the model quality is to add new features with additional predictive power. This is often performed by domain specialists or data scientists. It is a complex and time-consuming task because the domain expert needs to identify data sources for new features, join them, and finally select the features that are relevant to the prediction. We present a new system called KAFE (Knowledge Aided Feature Engineering), that helps build strong predictive models by automatically performing feature enhancement. It utilizes structured knowledge present on the web, such as knowledge graphs, web tables, etc., and figures out additional information that can improve the accuracy of predictive models. In this paper, we describe the key aspects of the system such as feature inference and selection, along with relevant data indexing for numerical and categorical features. 1 Introduction In recent years, we have witnessed proliferation of predictive modeling applications in various domains. The cornerstone of building effective models is successful feature engineering. It is a lengthy and time-consuming process that relies on the domain knowledge of the data scientist, and often accounts for up to 80% of the time involved in building predictive models 2 . For instance, when building a predictive model of sales at a store, seasonal holidays such as Christmas are important to encode as features, because they account for surges in sales. Similarly, weather can dampen store sales at a brick and mortar store, and so could represent a key feature for predicting sales. Often, the domain expertise needed for the addition of such features is not available or is prohibitively expensive to obtain. Moreover, even when it is available, the addition of features is mostly a human- driven process. It is tedious due to the guesswork, coding, and involves a lengthy trial and error process. However, lack of investment in feature engineering impacts the quality of models and consequently, results in a lack of confidence in the use of such models in the real world. Automation of these time-consuming aspects of a machine learning cycle is therefore an area of significant importance and recent interest. Different approaches to perform automated feature engineering have been proposed recently and are being used in the industry 3 . Recent works on feature engineering or enhancement [7, 9, 6, 11] have proposed different methods for transforming the feature space using mathematical functions in order to morph the data into a more suitable feature set. Similarly, Deep Learning performs feature engineering implicitly by embedding features based on learned * Work done during first author’s internship at IBM Research. 2 https://tinyurl.com/yyb83ujh 3 https://www.featuretools.com/, https://www.h2o.ai, https://www.ibm.com/cloud/ watson-studio/autoai, https://cloud.google.com/automl-tables/ 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.