Interactive Martingale Boosting Ashish Kulkarni and Pushpak Burange and Ganesh Ramakrishnan Department of Computer Science and Engineering Indian Institute of Technology Bombay Mumbai-400076, India {kulashish, pushpakburange}@gmail.com, ganesh@cse.iitb.ac.in Abstract We present an approach and a system that explores the application of interactive machine learning to a branch- ing program-based boosting algorithm—Martingale Boosting. Typically, its performance is based on the ability of a learner to meet a fixed objective and does not account for preferences (e.g. , low FPs) arising from an underlying classification problem. We use user preferences gathered on holdout data to guide the two-sided advantages of individual weak learners and tune them to meet these preferences. Extensive experiments show that while arbitrary preferences might be difficult to meet for a single classifier, a non-linear ensemble of classifiers as the one constructed by martingale boosting, performs better. 1 Introduction Boosting algorithms are efficient procedures that, given access to a weak learning algorithm, use weak learners to construct a strong learner with arbitrarily low error on any given probability distri- bution. Kearns and Valiant [Kearns and Valiant, 1994] defined a weak learning algorithm as the one that produces a hypothesis that performs slightly better than random on that distribution. There is a large body of work [Schapire, 1990; Freund, 1995; Kearns and Mansour, 1996] that has proposed and studied several boosting algorithms for their theoretical soundness. However, Martingale boosting (MB) [Long and Servedio, 2005], with its proven tolerance to noise and well-defined structure, might be a promising approach to practical classification problems. The algorithm assumes access to a weak learning algorithm with a two- sided (on positive and negative class) advantage γ, such that the accuracy of each weak learner is at least 1 2 +γ. However, a prac- tical application might prefer differential two-sided advantages (as we discuss next) and these are not explicitly modeled. Depending on the choice of the learning algorithm, a learner usually seeks a hypothesis that meets a fixed objective, such as “minimize logistic loss” or “maximize classification accuracy”. Such a predefined objective might fail to capture the preferences that are inherent in the underlying classification problem. For instance, in case of spam classification [Yih et al. , 2006], clas- sification of a non-spam document as spam might incur a higher utility cost, than a spam document which is undetected. On the other hand, in the medical domain, false negatives are indicative of missed diagnosis and having them might be catastrophic. In general, rather than seeking a hypothesis to meet a predefined preference, a learner might benefit from a human-in-the-loop approach where such preferences are specified by users in an interactive “dialog” with the model. These preferences in turn guide the two-sided advantages of the individual weak learners. 2 Related Work Classification with asymmetric costs: There is a body of re- search that explores the thesis that different learners might make errors on different training examples and therefore proposes a multistage cascade [Gavrilut et al., 2009; Kaynak and Alpaydin, 2000] of classifiers. The classification of an example is either a collective decision of all the classifiers in the cascade [Gavrilut et al. , 2009] or a classifier might only receive examples that are rejected by the previous classifiers [Kaynak and Alpaydin, 2000; Viola and Jones, 2001; Alpaydin and Kaynak, 1998]. Training us- ing utility has also been explored by [Wu et al., 2008], where, they propose an Asymmetric Support Vector Machine (ASVM) that accounts for tolerance to false-positives in its objective. Another approach known as stratification [Olshen and Stone, 1984] works by re-weighting instances, and is explored by [Yih et al., 2006] for the problem of spam classification. Cost-sensitive boosting-based approaches [Fan et al. , 1999; Masnadi-Shirazi and Vasconcelos, 2011] also incorporate cost in instance reweighting, however, they (1) assume prior knowledge of misclassification cost (2) weight training instances while we weight holdout data. Boosting: Boosting methods work by constructing several weak classifiers that collectively give rise to a strong classifier. While AdaBoost [Freund and Schapire, 1997] takes a linear combination of these classifiers, Freund [Freund, 1995] uses the majority vote to label instances. One of the drawbacks of these algorithms is their intolerance to noisy data and this has led to a growing interest in non-linear branching programs-based boosting approaches. Kearns et al. explored the boosting ability of top-down decision tree algorithms but identified the exponential growth of tree size as a problem. Branching programs are a generalization of decision trees and can represent functions that are significantly more powerful [Mansour and McAllester, 2002]. Recently [Long and Servedio, 2005] proposed an approach called martingale boosting that constructs branching programs with well-defined structure and has an elegant graph walk-based analysis. Adaptive martingale boosting [Long and Servedio,