1 Black-box error diagnosis in deep neural networks: a survey of tools Piero Fraternali * , Federico Milani † ,Rocio Nahime Torres ‡ , Niccol` o Zangrando § Politecnico di Milano Piazza Leonardo 32, Milan, Italy { * piero.fraternali, † federico.milani, ‡ rocionahime.torres, § niccolo.zangrando}@polimi.it Abstract—The application of Deep Neural Networks (DNNs) to a broad variety of tasks demands methods for coping with the complex and opaque nature of these architectures. The analysis of performance can be pursued in two ways. On one side, model interpretation techniques aim at “opening the box” to assess the relationship between the input, the inner layers and the output. For example saliency and attention models exploit knowledge of the architecture to capture the essential regions of the input that have most impact on the inference process and output. On the other hand, models can be analysed as “black boxes”, e.g., by associating the input samples with extra annotations that do not contribute to model training but can be exploited for char- acterizing the model response. Such performance-driven meta- annotations enable the detailed characterization of performance metrics and errors and help scientists identify the features of the input responsible of prediction failures and focus their model improvement efforts. This paper presents a structured survey of the tools that support the “black box” analysis of DNNs and discusses the gaps in the current proposals and the relevant future directions in this research ﬁeld. Index Terms—black-box, error diagnosis, machine learning, evaluation, metrics I. I NTRODUCTION The application of Deep Neural Networks (DNNs) to a broad variety of tasks demands methods for coping with the complex and opaque nature of these architectures. Such systems are normally used and evaluated as black-boxes: the quality of their output is validated either qualitatively via man- ual inspection or quantitatively by comparison with ground truth test data. Quantitative performance analysis exploits standard metrics, such as Accuracy, Precision, Recall, F1- Score, or Average Precision, which are implemented off-the- shelf in most DNN frameworks. Selecting the most appropriate metrics for quantitative performance analysis is by itself a concern that requires attention. Several works [1]–[4] discuss the challenges and the best practices relevant to the use of performance evaluation metrics both in general and for speciﬁc tasks. Standard metrics enable and end-to-end assessment of a model without providing any information on the potential sources of failures and on the components of the model that may cause them. Hence the difﬁculty of identifying ﬂaws in the architecture and of implementing appropriate coun- termeasures arises. Two approaches can be pursued to study model behavior. One line of research aims at improving model interpretability, by characterizing the relation of the internal Authors are listed in alphabetical order. representations of deep models to the input and output [5]– [8]. Techniques such as the Class Activation Maps (CAMs) [9]–[13] highlight the most inﬂuential regions of the feature maps at different network levels and enable better insight into the model behavior. An alternative option is to consider the model as a black box and analyze the impact that the properties of the input have on performances. The methods of this category enrich the description of the input samples with additional attributes not used for training and study how the performance metrics depend on the value of such attributes and how errors (e.g., wrong classiﬁcations or detections) correlate to speciﬁc fea- tures of the input. Such diagnosis-oriented input attributes can be either obtained automatically (e.g., image color space and aspect ratio, text language, etc.) or provided manually (e.g., domain-speciﬁc meta-data). A black-box performance analysis and error diagnosis tool can be used to address questions such as: “Does the model fail consistently when the inputs exhibit a speciﬁc characteristic?” “How much would a given performance metrics improve if a particular type of error is removed?” The insight resulting from black-box diagnosis can help model designers improve the training data set and/or focus the DNN design on the improvements with the highest expected gain. A. Focus of the Survey and Methodology The focus of this paper is a survey of the tools that support the black-box diagnosis of DNNs. The target of the research comprises those methods that exploit only knowledge about the input and output. Among such works, we highlight the proposals that provide a tool for DNN design, training and evaluation. This perimeter excludes contributions that also address DNN behavior and performances but pursue different targets such as special-purpose and domain-dependent evalua- tion metrics, the visualization of DNN internal representations, model design for interpretability, and human-in-the-loop inter- pretation. The corpus of the relevant research has been identiﬁed by means of the following procedure: 1) A keyword search has been conducted in the major bibliographic sources (Google, Google Scholar, DBPL, ACM Digital Library) using key phrases composed as follows: <search> :- <task> + <goal> + <system> arXiv:2201.06444v1 [cs.LG] 17 Jan 2022