Corpus and Deep Learning Classifier for Collection of Cyber Threat
Indicators in Twitter Stream
Vahid Behzadan, Carlos Aguirre, Avishek Bose and William Hsu
Department of Computer Science
Kansas State University
{behzadan, caguirre, abose, bhsu}@ksu.edu
Abstract
This paper presents a framework for detection and
classification of cyber threat indicators in the Twitter
stream. Contrary to the bulk of similar proposals that
rely on manually-designed heuristics and keyword-
based filtering of tweets, our framework provides a
data-driven approach for modeling and classification
of tweets that are related to cybersecurity events.
We present a cascaded Convolutional Neural Network
(CNN) architecture, comprised of a binary classifier
for detection of cyber-related tweets, and a multi-class
model for the classification of cyber-related tweets
into multiple types of cyber threats. Furthermore, we
present an open-source dataset of 21000 annotated
cyber-related tweets to facilitate the validation and
further research in this area.
1. Introduction
To keep pace with the growing complexity and
frequency of cyber attacks, defensive operations are
increasingly reliant on proactive measures. Such ap-
proaches require the timely, accurate, and actionable
understanding of the threats that pose potential risks
to protected systems. To meet this vital need, the
paradigm of Cyber Threat Intelligence (CTI) has been
introduced as a framework to facilitate the exploration,
collection, and analysis of various sources of informa-
tion on cyber threats.
An an important source of information, Open-
Source Intelligence (OSINT) have proven to be a
valuable resource for CTI. In particular, Twitter is
deemed as a rich source of OSINT. The popularity
of this medium among the cybersecurity community
provides an environment for both the offensive and
defensive practitioners to discuss, report, and advertise
timely indicators of vulnerabilities, attacks, malware,
and other types of cyber events that are of interest to
CTI analysts. The value of Twitter with regards to CTI
is well-demonstrated by the numerous initial reports of
major cyber events, recent examples of which include
disclosures of multiple 0-day Microsoft Windows vul-
nerabilities
1
, user reports on DDoS attacks [1], and
exposure of ransomware campaigns [2].
Over the recent years, the research on Twitter-based
OSINT collection has led to the proposal of multiple
frameworks (e.g., [3], [4], [5], [6], [7], [8], [9]) for
detection and analysis of threat indicators in the Twitter
stream. However, the majority of these proposals are
heavily based on manual heuristics such as keyword
lists for detecting and filtering tweets that are relevant
to cybersecurity. This will inevitably lead to high
false-positives in the detection of relevant tweets (e.g.,
filtering for the keyword “vulnerability” may result in
storing a personal or spiritual tweet as one related to
cybersecurity). Also, the flexible typography and the
emergence of new terminology lead to the neglection
of potentially valuable information in tweets. Further-
more, current state of the research in this area still
lacks open-source dataset of manually annotated cyber-
related tweets, which curtails further efforts to validate,
compare, and extend current frameworks.
Utilization of OSINT in CTI, particularly via social
informatics and text analytics, incur the challenges
of document filtering and threat identification. In this
work we describe the development of a social media
test bed based on information extraction and machine
learning for relevance filtering and classification of
new intelligence with respect to defined threat cate-
gories. This test bed in turn is part of a data mining
1. https://www.zdnet.com/article/microsoft-windows-zero-day-
disclosed-on-twitter-again/
2018 IEEE International Conference on Big Data (Big Data)
978-1-5386-5035-6/18/$31.00 ©2018 IEEE 5002