Corpus and Deep Learning Classiﬁer for Collection of Cyber Threat Indicators in Twitter Stream Vahid Behzadan, Carlos Aguirre, Avishek Bose and William Hsu Department of Computer Science Kansas State University {behzadan, caguirre, abose, bhsu}@ksu.edu Abstract This paper presents a framework for detection and classiﬁcation of cyber threat indicators in the Twitter stream. Contrary to the bulk of similar proposals that rely on manually-designed heuristics and keyword- based ﬁltering of tweets, our framework provides a data-driven approach for modeling and classiﬁcation of tweets that are related to cybersecurity events. We present a cascaded Convolutional Neural Network (CNN) architecture, comprised of a binary classiﬁer for detection of cyber-related tweets, and a multi-class model for the classiﬁcation of cyber-related tweets into multiple types of cyber threats. Furthermore, we present an open-source dataset of 21000 annotated cyber-related tweets to facilitate the validation and further research in this area. 1. Introduction To keep pace with the growing complexity and frequency of cyber attacks, defensive operations are increasingly reliant on proactive measures. Such ap- proaches require the timely, accurate, and actionable understanding of the threats that pose potential risks to protected systems. To meet this vital need, the paradigm of Cyber Threat Intelligence (CTI) has been introduced as a framework to facilitate the exploration, collection, and analysis of various sources of informa- tion on cyber threats. An an important source of information, Open- Source Intelligence (OSINT) have proven to be a valuable resource for CTI. In particular, Twitter is deemed as a rich source of OSINT. The popularity of this medium among the cybersecurity community provides an environment for both the offensive and defensive practitioners to discuss, report, and advertise timely indicators of vulnerabilities, attacks, malware, and other types of cyber events that are of interest to CTI analysts. The value of Twitter with regards to CTI is well-demonstrated by the numerous initial reports of major cyber events, recent examples of which include disclosures of multiple 0-day Microsoft Windows vul- nerabilities 1 , user reports on DDoS attacks [1], and exposure of ransomware campaigns [2]. Over the recent years, the research on Twitter-based OSINT collection has led to the proposal of multiple frameworks (e.g., [3], [4], [5], [6], [7], [8], [9]) for detection and analysis of threat indicators in the Twitter stream. However, the majority of these proposals are heavily based on manual heuristics such as keyword lists for detecting and ﬁltering tweets that are relevant to cybersecurity. This will inevitably lead to high false-positives in the detection of relevant tweets (e.g., ﬁltering for the keyword “vulnerability” may result in storing a personal or spiritual tweet as one related to cybersecurity). Also, the ﬂexible typography and the emergence of new terminology lead to the neglection of potentially valuable information in tweets. Further- more, current state of the research in this area still lacks open-source dataset of manually annotated cyber- related tweets, which curtails further efforts to validate, compare, and extend current frameworks. Utilization of OSINT in CTI, particularly via social informatics and text analytics, incur the challenges of document ﬁltering and threat identiﬁcation. In this work we describe the development of a social media test bed based on information extraction and machine learning for relevance ﬁltering and classiﬁcation of new intelligence with respect to deﬁned threat cate- gories. This test bed in turn is part of a data mining 1. https://www.zdnet.com/article/microsoft-windows-zero-day- disclosed-on-twitter-again/ 2018 IEEE International Conference on Big Data (Big Data) 978-1-5386-5035-6/18/$31.00 ©2018 IEEE 5002