demonstrably worthy of trust, tech that considers accountability, agency, and individual and collective well-being” [1]. As a part of this, Mozilla started the MozFest Trustworthy AI Working Groups; as members of the 2021 working group cohort, we at the O Foundation piloted an experimental framework called OpenSpeaks Before AI [2]. Instead of treating AI as a stand- alone area, we looked at a few open- source platforms that allow users to generate multilingual big data (useful for AI/ML) and audit them openly. We tried to see whether this pilot could help us derive best practices that were inclusive in nature and relevant for low- and medium-resourced languages. Broader feminist viewpoints [3] and two existing studies primarily inspired the process: a seminal paper titled “Datasheets for Datasets” [4], which focuses on identifying gaps and biases in datasets, and our own research on Web content monetization in two Indigenous languages from India: Ho and Santali [5]. We conducted two open audits in two languages, Odia and Santali, and of two recording platforms, Lingua Libre and Mozilla’s Common Voice, both of which help in creating multilingual speech data. Odia is a macrolanguage from India with nearly 45 million speakers; Santali is an Indigenous Indian language spoken by 9.6 million people. Lingua Libre and Common Voice are open-source platforms that allow users to record words and phrases (Lingua Libre) and sentences (Common Voice). The Lingua Libre study and its outcomes were explained in detail in the Wiki Workshop 2022, focusing on Odia and its Baleswari dialect [6]. The audit of Common Voice for Santali was presented during Mozilla Festival (MozFest) 2022 [6]. The OpenSpeaks Before AI T here has been a tremendous push on many levels to make artifcial intelligence– and machine learning– based applications ubiquitous. Soon, the life decisions of almost every digital technology user will be afected by some form of algorithmic decision making. However, the development of large language models (LLMs) that drive this research and development often lacks participation from people with diverse backgrounds, ignoring historically oppressed communities such as Black and other ethnolinguistic or socioeconomic minority groups, women, transgender individuals, people with disabilities, and elderly individuals globally, and the Dalit-Bahujan-Adivasi communities in South Asia and the diaspora. Data about and by these people is therefore systematically suppressed. Even more problematic is that this data is mostly suppressed in creating the LLMs driving AI/ML research and development. Furthermore, seemingly public information might not always be collected ethically with informed consent from the people afected. Even mature regulatory frameworks such as the General Data Protection Regulation (GDPR) in the European Union do not provide enough guidance on how private data is collected, stored, and shared. Naturally, those behind LLM creation do not have a clue about the biases in their data or how it is collected. Take the case of DALL-E 2 models, which use publicly available images owned and copyrighted by diferent people, or ChatGPT, which uses massive datasets from multiple sources. In both instances, not only does the LLM creation lack the representation of marginalized groups and contain only biased data about them, but also the outcomes that derive from the training data make these groups even more vulnerable. LOW-RESOURCE LANGUAGES The creation of LLMs like GPT-3, when used in applications such as chatbots, directly afects the dominant- language users. Underpaid tech support workers subcontracted to support users in developed countries might even see these chatbots as a potential threat. But when it comes to low- and medium-resourced languages, the issues stemming from biases and low representation can aggravate things further. The issues of many Indigenous, endangered, and low-and medium- resourced-language native speakers are poorly documented or missing in HCI research and development, particularly in AI-based tech innovations. For instance, issues with script input or other technological problems are generally documented and fxed for the most well-established and dominant writing systems and languages. Speakers of many languages spoken and written in nondominant settings often do not have the know-how or the means to report these issues publicly, or discuss them privately. OPENSPEAKS BEFORE AI Mozilla defnes trustworthy AI as “AI Subhashish Panigrahi OpenSpeaks Before AI Frameworks for Creating the AI/ML Building Blocks for Low-Resource Languages The development of large language models often lacks participation from people with diverse backgrounds. @INTERACTIONSMAG 6 INTERACTIONS MAY–JUNE 2023 BLOG@IX The Interactions website (interactions.acm.org) hosts a stable of bloggers who share insights and observations on HCI, often challenging current practices. Each issue we’ll publish selected posts from some of the leading and emerging voices in the field.