Supporting Law-Enforcement to Cope with Blacklisted Websites: Framework and Case Study Mir Mehedi Ahsan Pritom * and Shouhuai Xu * Department of Computer Science, University of Texas at San Antonio Department of Computer Science, University of Colorado Colorado Springs Abstract—Cyber attackers have long abused web domains and URLs to carry out various attacks such as Phishing, web scamming, and malware attacks. In order to defend against these attacks, URL blacklisting has been widely used. However, this approach has significant weaknesses, especially from a law- enforcement point of view. In particular, the law-enforcement does not know what to do with a blacklist because it is unclear what needs to be done (e.g., shutting down a host or domain) due to the subtleties associated with the problem. In order to help the law- enforcement in dealing with blacklisted URLs, we propose a novel framework based on Machine Learning (ML) while providing the law-enforcement with probabilistic classification and interpretabil- ity of the predictions made by the interpretable model. Our probabilistic classification and interpretability measures provide a basis for law-enforcement trustworthy decision-making and remove the black-box nature of traditional ML-based approaches. Experimental results show that the framework is practical and has further potential to tackle website maliciousness. I. I NTRODUCTION Websites have been widely abused as a medium for propa- gating cyberattacks [1]–[3]. One simple defense against these threats is to use URL and domain blacklists, which are client- side interventions and often provided by third-party vendors (e.g., Phishtank, Google Safe Browsing, URLhaus). However, this does not completely eliminate the threat because some users may not use such services and the malicious or com- promised domains or hosts are still on the loose. Moreover, these blacklists are far from perfect [4] because they are neither complete, meaning that they do not contain all of the malicious websites [5]–[7], nor accurate, meaning that they contain many false-positive websites (including the compromise ones that were malicious in the past but have already been cleaned up) [2]. Another defense is to use Machine Learning (ML) models to proactively detect malicious websites (see, e.g., [1], [2], [8]). Moreover, there are also some third-party vendors (e.g., Netcraft [9]) providing takedown services on user requests for protecting against cybercrimes (e.g., cybersquatting). of abusing domains that are imitating a user’s brand to provide user protection against cybercrimes. However, there is one important perspective that has not been investigated in the literature, namely law-enforcement. We envision that law-enforcement will be, if not already, authorized to take actions against malicious websites much alike they do in case of botnets [10]. This introduces a new dimension of the problem because the law-enforcement must treat detected malicious websites carefully. For example, the law-enforcement can be authorized to shut down a malicious website owned or operated by a malicious party, but may only be authorized to notify the owner or the operator of a website which itself is compromised and then abused by an attacker to wage further attacks. Moreover, oftentimes we observe that attackers reuse the same domains and hosts for new URL based attacks causing the same domain or hostname to appear in a URL blacklist [11]. This phenomena further encourages us to consider the law- enforcement perspective, as more higher level such as domain or host level intervention is more effective than client-side inter- ventions. This call for studies on helping the law enforcement in distinguishing between malicious (i.e., attacker-owned) and compromised (i.e., legitimate party-owned) websites to take actions. Our Contributions. In this paper, we make three contributions. First, we initiate the study of the law-enforcement perspective when coping with malicious websites. This turns out to be a challenge because of the dynamic nature of web domains and complexity of web hosting infrastructures. This prompts us to introduce a novel framework to help the law-enforcement to cope with malicious websites. The framework highlights the importance of using interpretable (i.e., explainable) ML, while considering the probabilistic uncertainty associated with the prediction outcomes. The framework integrates a ML intepretability system, such as InterpretML [12], to provide ex- planations and probabilistic predictions to the law-enforcement (e.g., why is a website predicted as malicious, and what is the likelihood it is indeed malicious?). Second, we investigate how to choose the entity for taking action: domain vs. hostname. To our knowledge, this is the first time to propose a principle method for making such decisions. Third, we conduct a case study on evaluating an instance of the framework with a real-world URL blacklist. Experimental results show that we achieve a 86% accuracy with a 0.92 F-1 score, while providing local explainability (i.e., interpretation) for the individual prediction outcomes for each input blacklisted website. Paper Outline. Section II presents the problem statement. Section III describes our framework. Section IV presents our case study and results. Section V discusses limitations of this study. Section VI discusses related prior studies. Section VII concludes the paper. 2022 IEEE Conference on Communications and Network Security (CNS) 978-1-6654-6255-6/22/$31.00 ©2022 IEEE 181 2022 IEEE Conference on Communications and Network Security (CNS) | 978-1-6654-6255-6/22/$31.00 ©2022 IEEE | DOI: 10.1109/CNS56114.2022.9947260 Authorized licensed use limited to: UNIV OF COLORADO COLORADO SPRINGS. Downloaded on December 24,2022 at 16:46:02 UTC from IEEE Xplore. Restrictions apply.