Malicious Website Detection: Effectiveness and Efficiency Issues Birhanu Eshete, Adolfo Villafiorita, Komminist Weldemariam Center for Information Technology (FBK-IRST) Fondazione Bruno Kessler via Sommarive 14, 38123 Trento, Italy Email: (eshete,adolfo, sisai)@fbk.eu Abstract—Malicious websites, when visited by an unsuspect- ing victim infect her machine to steal invaluable information, redirect her to malicious targets or compromise her system to mount future attacks. While the existing approaches have promising prospects in detecting malicious websites, there are still open issues in effectively and efficiently addressing: filtering of web pages from the wild, coverage of wide range of malicious characteristics to capture the big picture, continuous evolution of web page features, systematic combination of fea- tures, semantic implications of feature values on characterizing web pages, ease and cost of flexibility and scalability of analysis and detection techniques with respect to inevitable changes to the threat landscape. In this position paper, we highlight our ongoing efforts towards effective and efficient analysis and detection of malicious websites with a particular emphasis on broader feature space and attack-payloads, flexibility of techniques with changes in malicious characteristics and web pages and above all real-life usability of techniques in defending users against malicious websites. Keywords-Malicious Websites, Detection, Efficiency, Effec- tiveness I. I NTRODUCTION Attackers lure an unsuspecting victim to visit malicious websites and they steal important credentials from the victim or install malware on the victim’s machine to use it as a springboard for future exploits [1], [2], [3], [4]. When the victim visits a malicious website, the attack is initiated and up on finding evidences of exploitable vulnerabilities (e.g., of browser components [5], of browser extensions [6]), the attack payload is executed. To defend Web users against malicious websites, several automated analysis and detection techniques and have been proposed. However, given the alarming prevalence of mali- cious websites and the ever-changing techniques in crafting attack payloads combined with emerging threats, current approaches to tackle the problem have common and specific limitations in effectively and efficiently: characterizing the malicious payloads using a more complete feature set; incorporating inevitable evolution of web page features; systematic methods of selecting and composing web page features; ensuring the flexibility and scalability of feature extraction, model construction and model training. In this position paper, we highlight critical issues in this regard and propose a research roadmap in improving effectiveness and efficiency in automated analysis and detection of malicious websites. II. MALICIOUS WEBSITES:SCOPE OF THE PROBLEM To combat the impacts of malicious websites, approaches proposed fall into two complementary categories: static and dynamic analysis. The former rely mainly on the source code and some static features such as URL structure, host-based information, and web page content to perform analysis and construct characterizations of malicious payload. The latter focus on capturing ”behaviors” manifested when the page is rendered in a controlled environment. A strategy common to both approaches is that they extract features of some type for further analysis to get patterns of malicious payloads, based on which a classification algorithm is trained using machine learning techniques. A widely used protection technique is based on black- listing of known malicious URLs and IP addresses col- lected via manual reporting, honeypots, and custom analysis techniques. While lightweight to deploy and easy to use, blacklisting is effective only if one can exhaustively identify malicious websites and timely update the blacklist. In prac- tice, doing so is infeasible due to: fresh websites are too new to be blacklisted even if they are malicious, some websites could escape from blacklisting due to incorrect analysis (e.g., due to ”cloaking” ), and the attackers may frequently change where the malicious websites are hosted. In effect, the URLs and IP addresses may also change accordingly [1], [3]. Lexical aspects in the URL (e.g. URL length, domain name length, query length, path length) and host-based information (e.g. WHOIS information, DNS records) have been demonstrated to be successful in economically char- acterizing malicious web pages in [2] and [7] and partly in [1]. The major assumption in such approaches is the tendency of URL tokens and host-based values of malicious URLs to differ from that of the benign ones. The strength of such approaches is the speed of feature extraction without executing the URL. However, if we consider the WHOIS information of websites registered recently, by registrars with low reputation, such websites are likely to be classified as malicious due to low reputation scores. In effect, there is a high risk of false positives. Conversely, false negatives may arise–as old and well-reputed registrars may host malicious