Unveiling MIMETIC: Interpreting Deep Learning Trafﬁc Classiﬁers via XAI Techniques Alfredo Nascita, Antonio Montieri, Giuseppe Aceto, Domenico Ciuonzo, Valerio Persico, Antonio Pescapè University of Napoli “Federico II” (Italy) a.nascita@studenti.unina.it, {antonio.montieri, giuseppe.aceto, domenico.ciuonzo, valerio.persico, pescape}@unina.it Abstract—The widespread use of powerful mobile devices has deeply affected the mix of trafﬁc traversing both the Internet and enterprise networks (with bring-your-own-device policies). Trafﬁc encryption has become extremely common, and the quick proliferation of mobile apps and their simple distribution and update have created a speciﬁcally challenging scenario for trafﬁc classiﬁcation and its uses, especially network-security related ones. The recent rise of Deep Learning (DL) has responded to this challenge, by providing a solution to the time-consuming and human-limited handcrafted feature design, and better clas- siﬁcation performance. The counterpart of the advantages is the lack of interpretability of these black-box approaches, limiting or preventing their adoption in contexts where the reliability of results, or interpretability of polices is necessary. To cope with these limitations, eXplainable Artiﬁcial Intelligence (XAI) techniques have seen recent intensive research. Along these lines, our work applies XAI-based techniques (namely, Deep SHAP) to interpret the behavior of a state-of-the-art multimodal DL trafﬁc classiﬁer. As opposed to common results seen in XAI, we aim at a global interpretation, rather than sample-based ones. The results quantify the importance of each modality (payload- or header- based), and of speciﬁc subsets of inputs (e.g., TLS SNI and TCP Window Size) in determining the classiﬁcation outcome, down to per-class (viz. application) level. The analysis is based on a publicly-released recent dataset focused on mobile app trafﬁc. Index Terms—trafﬁc classiﬁcation; encrypted trafﬁc; explain- able artiﬁcial intelligence; deep learning; multimodal learning. I. I NTRODUCTION The knowledge of the mix of trafﬁc traversing a net- work is instrumental to several management activities: Trafﬁc Classiﬁcation (TC) has a key role in deﬁning a “normal” trafﬁc proﬁle for the purpose of anomaly detection, or to extracting (or inferring) ﬁngerprints for intrusion detection and attack identiﬁcation. Moreover, TC can be also exploited for deﬁning technical boundaries for censorship enforceability, and assessing the effectiveness of surveillance and blocking countermeasures. For these reasons TC has seen consistent research and ﬁeld adoption along the years, and is now seeing a renewed blossoming of interest due to recent evolution of network usage. Indeed, the widespread availability of well- equipped smartphones has impacted both the Internet and enterprise networks (due to bring-your-own-device policies), presenting a highly dynamic and extensively encrypted mix of trafﬁc. On the other hand, new powerful Artiﬁcial Intelligence techniques (namely Deep Learning, “DL” in the following) have become available to face the new classiﬁcation chal- lenges. DL approaches are characterized by a fully-automated feature extraction phase (with reduced need of human experts in the loop) and a greater ability of learning from huge volumes of data, that provides better performance than the traditional Machine Learning (ML) approaches. The highly desirable characteristics of DL come at the cost of lack of interpretability of their results, as the black-box nature of DL techniques hides the reason behind speciﬁc classiﬁcation outcomes. This impacts the understanding of classiﬁcation errors and the evaluation of the resilience against adversarial manipulation of trafﬁc to impair identiﬁcation. Moreover, by understanding the behavior of the learned model, performance enhancements can be pursued with much more focused and efﬁcient research, compared with a less-informed exploration of the (typically huge) hyper-parameters space. In fact, DL approaches keep naturally hidden the answers to basic questions like “which parts of a complex architecture mostly contribute to the ﬁnal decision?”, “which speciﬁc ﬁelds, packets, protocols are the most important in the classiﬁcation process?”, or “which ones are responsible for classiﬁcation errors or circumvention?”. The ﬁeld of eXplainable Artiﬁcial Intelligence (XAI) con- stitutes the answer to these needs, as it provides approaches and techniques able to relate the structure of the model and the input to the respective classiﬁcation outcome, partially revealing the (former) completely black box. The adoption of DL and (consequently) of XAI is relatively new, especially in the ﬁeld of network trafﬁc classiﬁcation: with this work we contribute to this step forward in the understanding of DL- based network trafﬁc classiﬁers. To this aim, we perform the behavior interpretation of a state-of-the-art DL architecture for TC we recently pro- posed [1], analyzing the relative importance of inputs at ﬁne grain (i.e. per-class) in the challenging task of classifying mobile apps. More speciﬁcally, we apply state-of-the-art XAI tools (namely, Deep SHAP [2]) to quantify and understand the importance of payload-derived and header-based inputs, further deepening the analysis to speciﬁc subsets of the inputs (i.e., TLS-SNI in the payload, and TCP Window Size and Payload Length, and packet Inter-Arrival Time and Direction for the header-based). To perform our experimental evaluation, we leverage the public trafﬁc dataset MIRAGE-2019 that focuses on mobile-app trafﬁc and is human-generated [3]. The paper is organized as follows. Section II surveys ﬁrst 978-1-7281-5684-2/20/$31.00 ©2021 IEEE