https://doi.org/10.1177/00027642211021630
American Behavioral Scientist
1–25
© 2021 SAGE Publications
Article reuse guidelines:
sagepub.com/journals-permissions
DOI: 10.1177/00027642211021630
journals.sagepub.com/home/abs
Article
Random Sampling in Corpus
Design: Cross-Context
Generalizability in
Automated Multicountry
Protest Event Collection
Erdem Yörük
1,2
, Ali Hürriyetoğlu
1
,
Fırat Duruşan
1
, and Çağrı Yoltar
1
Abstract
What is the most optimal way of creating a gold standard corpus for training a machine
learning system that is designed for automatically collecting protest information in a
cross-country context? We show that creating a gold standard corpus for training
and testing machine learning models on the basis of randomly chosen news articles
from news archives yields better performance than selecting news articles on the
basis of keyword filtering, which is the most prevalent method currently used in
automated event coding. We advance this new bottom-up approach to ensure
generalizability and reliability in cross-country comparative protest event collection
from international and local news in different countries, languages, sources and time
periods, which entails a large variety of event types, actors, and targets. We present
the results of comparing our random-sample approach with keyword filtering. We
show that the machine learning algorithms, and particularly state-of-the-art deep
learning tools, perform much better when they are trained with the gold standard
corpus from a randomly selected set of news articles from China, India, and South
Africa. Finally, we also present our approach to overcome the major ethical issues
that are intrinsic to protest event coding.
Keywords
natural language processing, machine learning, protests, contentious politics, event
data extraction, language resources
1
Koç University, Istanbul, Turkey
2
University of Oxford, Oxford, UK
Corresponding Author:
Erdem Yörük, Department of Sociology, Koç University, Istanbul 34450, Turkey.
Email: eryoruk@ku.edu.tr
1021630ABS XX X 10.1177/00027642211021630American Behavioral ScientistYörük et al.
research-article 2021