MT-Adapted Datasheets for Datasets: Template and Repository Marta R. Costa-juss ` a, Roger Creus, Oriol Domingo, Albert Dom´ ınguez, Miquel Escobar, Cayetana L´ opez, Marina Garcia and Margarita Geleta TALP Research Center Universitat Polit` ecnica de Catalunya, Barcelona marta.ruiz@upc.edu Abstract In this report we are taking the standardized model proposed by Gebru et al. (2018) for document- ing the popular machine translation datasets of the EuroParl (Koehn, 2005) and News-Commentary (Barrault et al., 2019). Within this documentation process, we have adapted the original datasheet to the particular case of data consumers within the Machine Translation area. We are also proposing a repository for collecting the adapted datasheets in this research area. 1 Introduction Social biases are currently affecting the widely used natural language processing systems (Costa-juss, 2019). While there are many proposed alternatives to mitigate this problem (Sun et al., 2019), there is still a long way to go (Gonen and Goldberg, 2019). Research directions vary from debiasing algorithms (Bolukbasi et al., 2016) to working directly towards fair or balanced datasets (Costa-juss et al., 2020). While research community keeps active in these lines, there is an urgent need for transparency in our systems. As correctly addressed by the original work of Gebru et al. (2018), DataSheets for DataSets proposes to create documentation for the datasets within the machine learning community to gain this transparency within research and in-production systems that are serving to different social purposes. In our work, we want to use the existing datasheet template 1 and slightly adapt it to serve two main purposes: dataset usage in Machine Translation (MT) and dataset consumer-oriented. Our purpose is to motivate the community to work on these datasheets, independently of being dataset creators, in order to have proper documentation of the datasets that we are currently using. In fact, this report is the initiative for an open repository which aims at collecting the datasheets for MT datasets and can be accessed in here 2 . The rest of the report is organised as follows. The next section reports how we have modified the datasheet templateby both excluding and adding questions. Then, Section 3 describes the repository to collect the datasheets which is open to contributions for documenting MT datasets. Finally, section 4 reports some final words. Appendices report the datasheet for EuroParl (Koehn, 2005) and News- Commentary (Barrault et al., 2019), two of the most popular MT datasets. 2 DataSheet for DataSets: Adaptations This section reports the main modifications done to the datasheet proposed by Gebru et al. (2018) targeting MT datasets consumers. While we want to perform the minimum changes to the original datasheet template, we have two main purposes to perform this adaptation. First, MT is clearly reporting biases (Prates et al., 2020). While there are some solutions proposed from the algorithmic point of view (Font and Costa-juss ` a, 2019) and ways to properly evaluate the bias (Stanovsky et al., 2019), there is no completed datasheet for any MT dataset since the first datasheet from Gebru et al (Gebru et al., 2018) appeared. Second, MT is an already well-established area of research with a lot of existing resources that are not documented at all. For this, we want to adapt the datasheet more to consumers than to creators. 1 https://www.overleaf.com/latex/templates/datasheet-for-dataset-template/ztkyvzddvxtd 2 https://mtdatasheets.cs.upc.edu arXiv:2005.13156v1 [cs.CL] 27 May 2020