A Computational Tool to Study Vocal Participation
of Women in UN-ITU Meetings
Rajat Hebbar
1
, Krishna Somandepalli
1
, Raghuveer Peri
1
, Ruchir Travadi
1
,
Tracy Tuplin
2
, Fernando Rivera
2
, and Shrikanth Narayanan
1
1
Signal Analysis and Interpretation Laboratory, University of Southern California, Los Angeles
2
International Telecommunication Union, United Nations, Geneva
Abstract—International organizations such as the United Na-
tions drive policies that impact our everyday lives. Diverse
representation of people and ideas in the decision making process
of such bodies is critical to ensure that the policies work for
everyone. One aspect of the representation is the partipants’
expressed gender. In this work, we focus on analyzing meetings
at the International Telecommunication Union (ITU). These meet-
ings include a moderator who mediates the proceedings between
delegates from across the world speaking in different languages.
For the purpose of quantifying the participation of delegates, we
propose a scalable, human-in-the-loop system to first identify the
moderator’s speech and estimate the speaking time with respect
to gender for all the speakers. Our proposed system includes
three main audio modules: speech activity detection, gender
identification and moderator verification using a human-labelled
speech probe. We then estimate percentage of speaking time
controlled for the moderator’s speech. We present detailed and
multilingual performance evaluation of the component systems
using state-of-the-art technologies for these tasks. Finally, we
examine the vocal participation of female delegates in the 2018
ITU Plenipotentiary Conference spanning for 18 days and about
108 hours of audio recordings.
Index Terms—Gender representation, United Nations, multi-
lingual meetings, audio systems
I. I NTRODUCTION
Over the years, female participation and involvement in
social and political spheres has seen a gradual uptrend [1].
However, this is not necessarily reflected in women’s influence
in the decision making processes at an institutional level.
Meetings, conferences and similar forums of discussion are
critical components of this process. A recent study [2] that
examined village community level meetings over a period
of seven months found that “Women's lack of participation
in important decision making is noted as an obstacle to
sustainable development” and “Even when women are present
at meetings they are still consistently less likely than men to
substantively participate”. Furthermore, multiple studies [3]–
[5] also reported under-participation of women in Q&A ses-
sions at academic conferences, despite nearly equal attendance
of both men and women. Besides highlighting the inequities
in our communities, such studies can also lead to systemic
changes in policy with a real impact on our daily lives [6].
A crucial limitation of scaling up such studies in related
domains is typically their dependence on manual labeling
which is expensive, time-consuming and can be prone to
errors. A promising direction to address this limitation is the
recent efforts in developing machine learning based tools to
automate our understanding of human representation in media
content [7]. A study on nearly 600 top grossing Hollywood
movies from 2014–2019 found that men speak significantly
more than women, even when women appear on screen [7].
Our focus in this work is to measure the participation
of women in the policy discussions at the United Nations
International Telecommunication Union (UN-ITU
1
). ITU is a
specialized agency that deals with setting international stan-
dards on issues related to communication technology (internet,
TV broadcasts, etc.), managing the radio-spectrum, satellite
orbits, and bridging the digital divide in the world. The far-
reaching impact of policy decisions made at the ITU is an
interesting case to study the diversity of participants. In this
context, we wish to quantify female representation beyond
simple attendance counts by analyzing the audio recordings
of the ITU meetings.
ITU meetings are multilingual (See Fig. 2) and generally
presided over by a moderator, who speaks in English. The
moderator typically begins the meeting and mediates the
representatives (delegates) from different countries throughout
the proceedings. As a result, the moderator is likely to account
for a substantial fraction of total speech in the meeting. Be-
cause our goal is to study the representation of the delegates’
speech in general, we need to reliably estimate the amount of
moderator’s speech. In order to address this, we use a speaker
verification (SV) module to query speech segments belonging
to the moderator and we account for this during the estimation
of delegate speaking time.
Acoustic variability poses major challenges to speech pro-
cessing systems. Different languages, variable or noisy record-
ing conditions and within-individual differences are a few
prominent sources of variability. Meetings in particular, are
susceptible to cross-talk, channel variations and other noises
associated with close-talk microphones such as breath noise
[8], [9]. These variables pose challenges to canonical audio
systems such as speech activity detection, automatic speech
recognition and speaker diarization [10], [11]. In this study,
we develop a framework using state-of-the-art methodologies
for the different components used to automate this process.
1
https://www.itu.int/en/Pages/default.aspx
2021 International Conference on Content-Based Multimedia Indexing (CBMI) | 978-1-6654-4220-6/20/$31.00 ©2021 IEEE | DOI: 10.1109/CBMI50038.2021.9461888
Authorized licensed use limited to: University of Southern California. Downloaded on July 25,2021 at 22:00:26 UTC from IEEE Xplore. Restrictions apply.