Unveiling MIMETIC: Interpreting Deep Learning Traffic Classifiers via XAI Techniques Alfredo Nascita, Antonio Montieri, Giuseppe Aceto, Domenico Ciuonzo, Valerio Persico, Antonio Pescapè University of Napoli “Federico II” (Italy) a.nascita@studenti.unina.it, {antonio.montieri, giuseppe.aceto, domenico.ciuonzo, valerio.persico, pescape}@unina.it Abstract—The widespread use of powerful mobile devices has deeply affected the mix of traffic traversing both the Internet and enterprise networks (with bring-your-own-device policies). Traffic encryption has become extremely common, and the quick proliferation of mobile apps and their simple distribution and update have created a specifically challenging scenario for traffic classification and its uses, especially network-security related ones. The recent rise of Deep Learning (DL) has responded to this challenge, by providing a solution to the time-consuming and human-limited handcrafted feature design, and better clas- sification performance. The counterpart of the advantages is the lack of interpretability of these black-box approaches, limiting or preventing their adoption in contexts where the reliability of results, or interpretability of polices is necessary. To cope with these limitations, eXplainable Artificial Intelligence (XAI) techniques have seen recent intensive research. Along these lines, our work applies XAI-based techniques (namely, Deep SHAP) to interpret the behavior of a state-of-the-art multimodal DL traffic classifier. As opposed to common results seen in XAI, we aim at a global interpretation, rather than sample-based ones. The results quantify the importance of each modality (payload- or header- based), and of specific subsets of inputs (e.g., TLS SNI and TCP Window Size) in determining the classification outcome, down to per-class (viz. application) level. The analysis is based on a publicly-released recent dataset focused on mobile app traffic. Index Terms—traffic classification; encrypted traffic; explain- able artificial intelligence; deep learning; multimodal learning. I. I NTRODUCTION The knowledge of the mix of traffic traversing a net- work is instrumental to several management activities: Traffic Classification (TC) has a key role in defining a “normal” traffic profile for the purpose of anomaly detection, or to extracting (or inferring) fingerprints for intrusion detection and attack identification. Moreover, TC can be also exploited for defining technical boundaries for censorship enforceability, and assessing the effectiveness of surveillance and blocking countermeasures. For these reasons TC has seen consistent research and field adoption along the years, and is now seeing a renewed blossoming of interest due to recent evolution of network usage. Indeed, the widespread availability of well- equipped smartphones has impacted both the Internet and enterprise networks (due to bring-your-own-device policies), presenting a highly dynamic and extensively encrypted mix of traffic. On the other hand, new powerful Artificial Intelligence techniques (namely Deep Learning, “DL” in the following) have become available to face the new classification chal- lenges. DL approaches are characterized by a fully-automated feature extraction phase (with reduced need of human experts in the loop) and a greater ability of learning from huge volumes of data, that provides better performance than the traditional Machine Learning (ML) approaches. The highly desirable characteristics of DL come at the cost of lack of interpretability of their results, as the black-box nature of DL techniques hides the reason behind specific classification outcomes. This impacts the understanding of classification errors and the evaluation of the resilience against adversarial manipulation of traffic to impair identification. Moreover, by understanding the behavior of the learned model, performance enhancements can be pursued with much more focused and efficient research, compared with a less-informed exploration of the (typically huge) hyper-parameters space. In fact, DL approaches keep naturally hidden the answers to basic questions like “which parts of a complex architecture mostly contribute to the final decision?”, “which specific fields, packets, protocols are the most important in the classification process?”, or “which ones are responsible for classification errors or circumvention?”. The field of eXplainable Artificial Intelligence (XAI) con- stitutes the answer to these needs, as it provides approaches and techniques able to relate the structure of the model and the input to the respective classification outcome, partially revealing the (former) completely black box. The adoption of DL and (consequently) of XAI is relatively new, especially in the field of network traffic classification: with this work we contribute to this step forward in the understanding of DL- based network traffic classifiers. To this aim, we perform the behavior interpretation of a state-of-the-art DL architecture for TC we recently pro- posed [1], analyzing the relative importance of inputs at fine grain (i.e. per-class) in the challenging task of classifying mobile apps. More specifically, we apply state-of-the-art XAI tools (namely, Deep SHAP [2]) to quantify and understand the importance of payload-derived and header-based inputs, further deepening the analysis to specific subsets of the inputs (i.e., TLS-SNI in the payload, and TCP Window Size and Payload Length, and packet Inter-Arrival Time and Direction for the header-based). To perform our experimental evaluation, we leverage the public traffic dataset MIRAGE-2019 that focuses on mobile-app traffic and is human-generated [3]. The paper is organized as follows. Section II surveys first 978-1-7281-5684-2/20/$31.00 ©2021 IEEE