arXiv:2201.05051v2 [cs.CL] 14 Jan 2022 Speech Resources in the Tamasheq Language Marcely Zanon Boito¹, Fethi Bougares², Florentin Barbier³, Souhir Gahbiche³, Lo¨ ıc Barrault², Mickael Rouvier¹, Yannick Est` eve¹ ¹LIA - Avignon Universit´ e, France ²LIUM - Le Mans Universit´ e, France ³Airbus - France contact: {marcely.zanon-boito, yannick.esteve}@univ-avignon.fr Abstract In this paper we present two datasets for Tamasheq, a developing language mainly spoken in Mali and Niger. These two datasets were made available for the IWSLT 2022 low-resource speech translation track, and they consist of collections of radio recordings from the Studio Kalangou (Niger) and Studio Tamani (Mali) daily broadcast news. We share (i) a massive amount of unlabeled audio data (671 hours) in five languages: French from Niger, Fulfulde, Hausa, Tamasheq and Zarma, and (ii) a smaller parallel corpus of audio recordings (17 hours) in Tamasheq, with utterance-level translations in the French language. All this data is shared under the Creative Commons BY-NC-ND 3.0 license. We hope these resources will inspire the speech community to develop and benchmark models using the Tamasheq language. Keywords: speech corpus, speech translation, tamasheq, zarma, hausa, fulfulde, french 1. Introduction The vast majority of speech pipelines are developed for and in high-resource languages, a small percentage of languages for which there is a large amount of an- notated data freely available (Joshi et al., 2020). This not only limits the investigation of language impact in current pipelines, as the applied languages are usually from the same subset, but it also fails to reflect the real- world performance these approaches will have in di- verse and smaller datasets. In recent years, the IWSLT Challenge 1 introduced a low-resource speech translation track focused on de- veloping and benchmarking translation tools for under- resourced languages. While for a vast majority of these languages, there is a lack of speech translation paral- lel data at the scale needed to train large translation models, in these cases we might still have access to limited disparate resources, such as word-level trans- lations, small parallel text data, monolingual text, and raw audio. The challenge is then to leverage this data, in order to build effective systems under these realistic settings. In this paper we present the resources in the Tamasheq language we share in the context of the IWSLT 2022: low-resource speech translation track. Tamasheq is a variety of Tuareg, a Berber macro-language spoken by nomadic tribes across North Africa in Algeria, Mali, Niger and Burkina Faso (Heath, 2006). It accounts for approximately 500,000 na- tive speakers, being mostly spoken in Mali and Niger (Ethnologue: Languages of the World, 2021). We share a large audio corpus, made of 224 hours of Tamasheq, together with 417 hours in other four languages of Niger (French, Fulfulde, Hausa and Zarma). We also share a smaller corpus of 17 hours 1 https://iwslt.org/2022/low-resource of Tamasheq utterances aligned with French transla- tions. We hope that these resources will represent an interesting use-case for the speech community, allow- ing them to not only develop low-resource speech sys- tems in Tamasheq, but also to investigate the leveraging of unannotated audio data in diverse languages that co- exist in the same geographic region. This paper is organized as follows. Section 2 presents the source content of the data shared: thanks to the Fon- dation Hirondelle Initiative and local partners, we are able to collect broadcast news in diverse African lan- guages. Section 3 presents the small parallel corpus between Tamasheq and French, and Section 4 presents the collection of unannotated audio data in French from Niger, Fulfulde, Hausa, Tamasheq and Zarma. Finally, Section 5 presents a speech translation baseline model, and Section 6 concludes this work. 2. The source content: The Fondation Hirondelle Initiative The Fondation Hirondelle 2 is a Swiss non-profit orga- nization founded in 1995 by journalists, with the goal of supporting local independent media in areas of so- cial unrest. They produce and broadcast information and dialogue programs in different countries, provid- ing local partners with editorial, managerial and struc- tural support and training to function in a sustainable manner. In this work we focus on their daily news radio episodes, produced by local partners and broadcast by them in different languages. These allow the local com- munities to get informed in their own dialects, in con- trast to mainstream media that tends to cover only the countries’ official languages. For the Tamasheq lan- 2 https://www.hirondelle.org/en/