LAPID ET AL. 1 An Evolutionary, Gradient-Free, Query-Efficient, Black-Box Algorithm for Generating Adversarial Instances in Deep Networks Raz Lapid, Zvika Haramaty, Moshe Sipper Abstract—Deep neural networks (DNNs) are sensitive to adversarial data in a variety of scenarios, including the black-box scenario, where the attacker is only al- lowed to query the trained model and receive an output. Existing black-box methods for creating adversarial instances are costly, often using gradient estimation or training a replacement network. This paper introduces Query-Efficient Evolutionary Attack, QuEry Attack, an untargeted, score-based, black-box attack. QuEry Attack is based on a novel objective function that can be used in gradient-free optimization problems. The attack only requires access to the output logits of the classifier and is thus not affected by gradient masking. No additional information is needed, rendering our method more suitable to real-life situations. We test its performance with three different state-of-the-art models—Inception-v3, ResNet-50, and VGG-16-BN— against three benchmark datasets: MNIST, CIFAR10 and ImageNet. Furthermore, we evaluate QuEry At- tack’s performance on non-differential transformation defenses and state-of-the-art robust models. Our re- sults demonstrate the superior performance of QuEry Attack, both in terms of accuracy score and query efficiency. Index Terms—Deep learning, computer vision, ad- versarial attack, evolutionary algorithm. I. I NTRODUCTION D EEP neural networks (DNNs) have become the central approach in modern-day artificial intelligence (AI) research. They have attained superb performance in multifarious complex tasks and are behind fundamental breakthroughs in a variety of machine-learning tasks that were previously thought to be too difficult. Image classification, object detec- tion, machine translation, and sentiment analysis are just a few examples of domains revolutionized by DNNs. Despite their success, recent studies have shown that DNNs are vulnerable to adversarial attacks. A barely detectable change in an image, for example, can cause a misclassification in a well-trained DNN. The authors are with the Department of Computer Science, Ben-Gurion University, Beer Sheva 84105, Israel. Corresponding author: R. Lapid, razla@post.bgu.ac.il. Targeted adversarial examples can even evoke a misclassification of a specific class (e.g., misclassify a car as a cat). Researchers have demonstrated that adversarial attacks are successful in the real world and may be produced for data modalities beyond imaging, e.g., natural language and voice recognition [1], [2], [3], [4]. DNNs’ vulnerability to adversarial attacks has raised concerns about applying these techniques to safety-critical applications. To discover effective adversarial instances, most past work on adversarial attacks has employed gradient-based optimization [5], [6], [7], [8], [9]. Gradient computation can only be executed if the attacker is fully aware of the model architecture and weights. Thus, these approaches are only useful in a white-box scenario, where an attacker has complete access and control over a targeted DNN. Attacking real-world AI systems, however, might be far more arduous. The attacker must consider the difficulty of implementing adversarial instances in a black-box setting, in which no information about the network design, parameters, or training data is provided. In this situation, the attacker is exposed only to the clas- sifier’s input-output pairs. In this context, a typical strategy has been to attack trained replacement net- works and hope that the generated examples transfer to the target model [10]. The substantial mismatch of the model between the alternative model and the target model, as well as the significant computational cost of alternative network training, often renders this technique ineffective. In our work we assume a real-world, black-box attack scenario, wherein a DNN’s input and output may be accessed but not its internal configuration. We focus on a scenario in which a specific DNN is an image classifier, specifically, a convolutional neural network (CNN), which accepts an image as input and outputs a probability score for each class. Herein, we present an evolutionary, gradient-free optimization approach for generating adversarial in- stances. Our proposed attack can deal with either constrained (ǫ value that constrains the norm of the arXiv:2208.08297v2 [cs.CV] 13 Sep 2022