ROBY: a Tool for Robustness Analysis of Neural Network Classiﬁers Paolo Arcaini National Institute of Informatics Tokyo, Japan arcaini@nii.ac.jp Andrea Bombarda University of Bergamo Bergamo, Italy andrea.bombarda@unibg.it Silvia Bonfanti University of Bergamo Bergamo, Italy silvia.bonfanti@unibg.it Angelo Gargantini University of Bergamo Bergamo, Italy angelo.gargantini@unibg.it Abstract—Classiﬁcation using Artiﬁcial Neural Networks (ANNs) is widely applied in critical domains, such as autonomous driving and in the medical practice; therefore, their validation is extremely important. A common approach consists in assessing the network robustness, i.e., its ability to correctly classify input data that is particularly challenging for classiﬁcation. We re- cently proposed a robustness deﬁnition that considers input data degraded by alterations that may occur in reality; the approach was originally devised for image classiﬁcation in the medical domain. In this paper, we extend the deﬁnition of robustness to any type of input for which some alterations can be deﬁned. Then, we present ROBY, a tool for ROBustness analYsis of ANNs. The tool accepts different types of data (images, sounds, text, etc.) stored either locally or on Google Drive. The user can use some alterations provided by the tool, or deﬁne their own. The robustness computation can be performed either locally or remotely on Google Colab. The tool has been experimented for robustness computation of image and sound classiﬁers, used in the medical and automotive domains. Index Terms—NN classiﬁer, robustness, ML testing I. I NTRODUCTION Artiﬁcial Neural Networks (ANNs) are increasingly used to perform different activities [4], among which classiﬁcation is one of the most popular. ANN-based classiﬁcation is used in critical domains [13] as, e.g., during autonomous driving [7], or in the medical practice [6]; therefore, their validation is of paramount importance. A desired property of an ANN to be tested, is its robustness, i.e., the ability of the network to correctly evaluate unknown (not seen during training) inputs. Typical approaches deﬁne the robustness using adversarial examples, i.e., inputs that are particular challenging for the network under test. However, it has been noted that, since adversarial examples are often created by exploiting the in- ternal structure of the network [17], they may not reﬂect real inputs that could occur during the network usage [10], [11], [20]. Therefore, we have recently proposed to deﬁne the robustness by considering real alterations that may occur to input data. In [5], we deﬁned the typical alterations (e.g., blur) that could occur during image acquisition in the medical practice of cancer detection using Convolutional Neural Net- works (CNNs); moreover, we also provided a formal deﬁnition P. Arcaini is supported by ERATO HASUO Metamathematics for Systems Design Project (No. JPMJER1603), JST, and Engineerable AI Techniques for Practical Applications of High-Quality Machine Learning-based Systems Project (Grant Number JPMJMI18BB), JST-Mirai. of robustness that assesses to what extent the network is robust against image alterations. In a similar way, Secci and Ceccarelli [15] deﬁned alterations that could occur during the image acquisition of an RGB camera (e.g., condensation), and assessed the performance decrease of an autonomous driving agent that takes in input such altered images. Given the widespread use of ANNs in critical contexts, there is an increasing need for ANN testing approaches that are actually implemented in usable tools, so that they can be adopted in the development practice. However, while several techniques have been constantly proposed for ANN testing [14], [21], their implementation is often not available: a recent survey on ML testing [14] found that 71% of the surveyed papers do not provide any artifact. Even when the implementation is available, the techniques are usually only applicable to the considered domain, and they require quite some effort to be adopted in other contexts. Our deﬁnition of robustness [5] was originally proposed and implemented only for image classiﬁcation, and it was experimented only in the medical domain. However, the deﬁ- nition is quite general, and it is applicable to the classiﬁcation of different types of data (e.g., images, sounds, etc.), once the alterations typical of the domain (e.g., medical domain, autonomous driving, speech recognition for domotics, etc.) are deﬁned. Therefore, in this paper, we generalize our approach, and we present ROBY, a tool for ROBustness analYsis. The tool has been engineered so that it can be used, with minimal effort, by different users in different domains for different types of data. A user must only specify: • the location of their test data set; • how to retrieve the correct classiﬁcation of the test data; • which alterations to apply to input data; the user can use standard ones provided by the tool, or specify their own; • where to run the robustness computation (either locally or on Google Colab). As a result, the tool computes the robustness values for the different alterations, and produces plots that visualize how the accuracy changes when alterations are applied and, in this way, visualizes the robustness. The tool is available at: https://github.com/fmselab/roby and it can also be installed using the package manager pip: