Just Another Day on Twitter: A Complete 24 Hours of Twitter Data
J¨ urgen Pfeffer
1
, Daniel Matter
1
, Kokil Jaidka
2
, Onur Varol
3
, Afra Mashhadi
4
, Jana Lasser
5, 15
,
Dennis Assenmacher
6
, Siqi Wu
7
, Diyi Yang
8
, Cornelia Brantner
9
, Daniel M. Romero
7
, Jahna
Otterbacher
10
, Carsten Schwemmer
11
, Kenneth Joseph
12
, David Garcia
13
, Fred Morstatter
14
1
School of Social Science and Technology, Technical University of Munich, Germany
2
Centre for Trusted Internet and Community, National University of Singapore, Singapore
3
Computer Science Department, Sabanci University, Turkey
4
School of Science, Technology, Engineering Mathematics, University of Washington (Bothell), USA
5
Faculty of Computer Science and Biomedical Engineering, Graz University of Technology, Austria
6
GESIS – Leibniz Institute for the Social Sciences, Germany
7
School of Information, University of Michigan, USA
8
Computer Science Department, Stanford University, USA
9
Department of Geography, Media and Communication, Karlstad University, Sweden
10
Faculty of Pure and Applied Sciences, Open University of Cyprus & CYENS CoE, Cyprus
11
Department of Sociology, Ludwig Maximilian University of Munich, Germany
12
Department of Computer Science and Engineering, University at Buffalo, USA
13
Department of Politics and Public Administration, University of Konstanz, Germany
14
Information Sciences Institute, University of Southern California, USA
15
Complexity Science Hub Vienna, Austria
Abstract
At the end of October 2022, Elon Musk concluded his acqui-
sition of Twitter. In the weeks and months before that, sev-
eral questions were publicly discussed that were not only of
interest to the platform’s future buyers, but also of high rele-
vance to the Computational Social Science research commu-
nity. For example, how many active users does the platform
have? What percentage of accounts on the site are bots? And,
what are the dominating topics and sub-topical spheres on the
platform? In a globally coordinated effort of 80 scholars to
shed light on these questions, and to offer a dataset that will
equip other researchers to do the same, we have collected all
375 million tweets published within a 24-hour time period
starting on September 21, 2022. To the best of our knowl-
edge, this is the first complete 24-hour Twitter dataset that
is available for the research community. With it, the present
work aims to accomplish two goals. First, we seek to an-
swer the aforementioned questions and provide descriptive
metrics about Twitter that can serve as references for other
researchers. Second, we create a baseline dataset for future
research that can be used to study the potential impact of the
platform’s ownership change.
Introduction
On March 21, 2006, Twitter’s first CEO Jack Dorsey sent
the first message on the platform. In the subsequent 16 years,
close to 3 trillion tweets have been sent.
1
Roughly two-thirds
of these have been either removed from the platform because
the senders deleted them or because the accounts (and all
Copyright © 2023, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
1
While we do not have an official source for this number, it rep-
resents an educated guess from a collaboration of dozens of schol-
ars of Twitter.
their tweets) have been banned from the platform, have been
made private by the users, or are otherwise inaccessible via
the historic search with the v2 API endpoints. By utilizing
Twitter’s count/all API and the approaches described in this
article, we estimate that about 900 billion public tweets were
on the platform when Elon Musk acquired Twitter in Octo-
ber 2022 for $44B
2
.
Besides its possible economic value, Twitter has been
instrumental in studying human behavior with social me-
dia data and the entire field of Computational Social Sci-
ence (CSS) has heavily relied on data from Twitter. At the
AAAI International Conference on Web and Social Media
(ICWSM), in the past two years alone (2021-2022), over
30 scientific papers analyzed a subset of Twitter for a wide
range of topics ranging from public and mental health anal-
yses to politics and partisanship. Indeed, since its emer-
gence, Twitter has been described as a digital socioscope
(i.e., social telescope) by researchers in fields of social sci-
ence (Mejova, Weber, and Macy 2015), “a massive antenna
for social science that makes visible both the very large (e.g.,
global patterns of communications) and the very small (e.g.,
hourly changes in emotions)”. Beyond CSS, there is increas-
ing use of Twitter data for training large pre-trained language
models in the field of natural language processing and ma-
chine learning, such as Bernice (DeLucia et al. 2022), where
2.5 billion tweets are used to develop representations for
Twitter-specific languages, and TwHIN-BERT (Zhang et al.
2022) that leverages 7 billion tweets covering over 100 dis-
tinct languages to model short, noisy, and user-generated
text.
Although Twitter data has fostered interdisciplinary re-
2
https://www.nytimes.com/2022/10/27/technology/elon-musk-
twitter-deal-complete.html
Proceedings of the Seventeenth International AAAI Conference on Web and Social Media (ICWSM 2023)
1073