MLVis International Workshop on Machine Learning in Visualisation for Big Data (2021) D. Archambault, J. Peltonen, and I. Nabney (Editors) Controllably Sparse Perturbations of Robust Classifiers for Explaining Predictions and Probing Learned Concepts Jay Roberts 1 and Theodoros Tsiligkaridis 2 MIT Lincoln Laboratory 1 Homeland Sensors and Analytics Group 2 Artificial Intelligence Technology Group Abstract Explaining the predictions of a deep neural network (DNN) in image classification is an active area of research. Many methods focus on localizing pixels, or groups of pixels, which maximize a relevance metric for the prediction. Others aim at creating local "proxy" explainers which aim to account for an individual prediction of a model. We aim to explore "why" a model made a prediction by perturbing inputs to robust classifiers and interpreting the semantically meaningful results. For such an expla- nation to be useful for humans it is desirable for it to be sparse; however, generating sparse perturbations can computationally expensive and infeasible on high resolution data. Here we introduce controllably sparse explanations that can be efficiently gen- erated on higher resolution data to provide improved counter-factual explanations. Further we use these controllably sparse explanations to probe what the robust classifier has learned. These explanations could provide insight for model developers as well as assist in detecting dataset bias. CCS Concepts Computing methodologies Machine learning; Artificial intelligence; 1. Introduction Deep Convolutional Neural Networks (CNNs) have revolutionized the field of computer vision [LBH15] and are increasingly being deployed across high stakes domains such as autonomous driving, medical diagnosis, and many others. Despite such proliferation, the high capacity complex nature of CNNs has made an encompassing theory of how they make their decisions elusive, with many end users treating CNNs as a "black box". This has led to thrusts in both academia and industry to establish frameworks of reliability and transparency for artificial intelligence as a whole [Pic18, Mic19, Lop20]. An additional concern for such high capacity models is their decisions may be unstable. Small perturbations of inputs can dramatically change a model’s predictions [GSS15]. This work has been studied extensively in the field with many defenses proposed to make models robust to such attacks [KGB17, RDV18, MMS * 18, MDUFF19]. Adversarial robustness conveys benefits beyond its original in- tent and may improve a wide class of explainability techniques. The improvement robustness provides to saliency maps has been studied before [ELMS19]. Figure 1 shows examples of common pixel attribution based methods, [STY17, STK * 17] and how robust- ness leads improved visualizations over than their standard coun- terparts. Local linear proxy models have been used as explana- tion techniques [RSG16b, AMJ18, PASC * 20]. There is ample work Figure 1: A comparison of standard (top) and robust (bottom) mod- els for various saliency explanation methods. suggesting that the mechanism underlying adversarial robustness is the regularity (or local linearity) of the loss landscape for the model [LHL15, RDV18, MDUFF19, QMG * 19], which can aid the search for these proxies. Finally, it has also been observed that ro- bust models exhibit generative features that align with those a hu- man would use to classify an image [STT * 19, IST * 19, EIS * 19]. For these reasons we leverage adversarially robust models for xAI techniques that capture meaningful semantics. 1.1. Related Works Many visual explanation techniques for image classification fo- cus on pixel importance via gradient methods for prediction im- c 2021 The Author(s) Eurographics Proceedings c 2021 The Eurographics Association. DOI: 10.2312/mlvis.20211072 https://diglib.eg.org https://www.eg.org