A Computational Tool to Study Vocal Participation of Women in UN-ITU Meetings Rajat Hebbar 1 , Krishna Somandepalli 1 , Raghuveer Peri 1 , Ruchir Travadi 1 , Tracy Tuplin 2 , Fernando Rivera 2 , and Shrikanth Narayanan 1 1 Signal Analysis and Interpretation Laboratory, University of Southern California, Los Angeles 2 International Telecommunication Union, United Nations, Geneva Abstract—International organizations such as the United Na- tions drive policies that impact our everyday lives. Diverse representation of people and ideas in the decision making process of such bodies is critical to ensure that the policies work for everyone. One aspect of the representation is the partipants’ expressed gender. In this work, we focus on analyzing meetings at the International Telecommunication Union (ITU). These meet- ings include a moderator who mediates the proceedings between delegates from across the world speaking in different languages. For the purpose of quantifying the participation of delegates, we propose a scalable, human-in-the-loop system to ﬁrst identify the moderator’s speech and estimate the speaking time with respect to gender for all the speakers. Our proposed system includes three main audio modules: speech activity detection, gender identiﬁcation and moderator veriﬁcation using a human-labelled speech probe. We then estimate percentage of speaking time controlled for the moderator’s speech. We present detailed and multilingual performance evaluation of the component systems using state-of-the-art technologies for these tasks. Finally, we examine the vocal participation of female delegates in the 2018 ITU Plenipotentiary Conference spanning for 18 days and about 108 hours of audio recordings. Index Terms—Gender representation, United Nations, multi- lingual meetings, audio systems I. I NTRODUCTION Over the years, female participation and involvement in social and political spheres has seen a gradual uptrend [1]. However, this is not necessarily reﬂected in women’s inﬂuence in the decision making processes at an institutional level. Meetings, conferences and similar forums of discussion are critical components of this process. A recent study [2] that examined village community level meetings over a period of seven months found that “Women's lack of participation in important decision making is noted as an obstacle to sustainable development” and “Even when women are present at meetings they are still consistently less likely than men to substantively participate”. Furthermore, multiple studies [3]– [5] also reported under-participation of women in Q&A ses- sions at academic conferences, despite nearly equal attendance of both men and women. Besides highlighting the inequities in our communities, such studies can also lead to systemic changes in policy with a real impact on our daily lives [6]. A crucial limitation of scaling up such studies in related domains is typically their dependence on manual labeling which is expensive, time-consuming and can be prone to errors. A promising direction to address this limitation is the recent efforts in developing machine learning based tools to automate our understanding of human representation in media content [7]. A study on nearly 600 top grossing Hollywood movies from 2014–2019 found that men speak signiﬁcantly more than women, even when women appear on screen [7]. Our focus in this work is to measure the participation of women in the policy discussions at the United Nations International Telecommunication Union (UN-ITU 1 ). ITU is a specialized agency that deals with setting international stan- dards on issues related to communication technology (internet, TV broadcasts, etc.), managing the radio-spectrum, satellite orbits, and bridging the digital divide in the world. The far- reaching impact of policy decisions made at the ITU is an interesting case to study the diversity of participants. In this context, we wish to quantify female representation beyond simple attendance counts by analyzing the audio recordings of the ITU meetings. ITU meetings are multilingual (See Fig. 2) and generally presided over by a moderator, who speaks in English. The moderator typically begins the meeting and mediates the representatives (delegates) from different countries throughout the proceedings. As a result, the moderator is likely to account for a substantial fraction of total speech in the meeting. Be- cause our goal is to study the representation of the delegates’ speech in general, we need to reliably estimate the amount of moderator’s speech. In order to address this, we use a speaker veriﬁcation (SV) module to query speech segments belonging to the moderator and we account for this during the estimation of delegate speaking time. Acoustic variability poses major challenges to speech pro- cessing systems. Different languages, variable or noisy record- ing conditions and within-individual differences are a few prominent sources of variability. Meetings in particular, are susceptible to cross-talk, channel variations and other noises associated with close-talk microphones such as breath noise [8], [9]. These variables pose challenges to canonical audio systems such as speech activity detection, automatic speech recognition and speaker diarization [10], [11]. In this study, we develop a framework using state-of-the-art methodologies for the different components used to automate this process. 1 https://www.itu.int/en/Pages/default.aspx 2021 International Conference on Content-Based Multimedia Indexing (CBMI) | 978-1-6654-4220-6/20/$31.00 ©2021 IEEE | DOI: 10.1109/CBMI50038.2021.9461888 Authorized licensed use limited to: University of Southern California. Downloaded on July 25,2021 at 22:00:26 UTC from IEEE Xplore. Restrictions apply.