Data Hazards v1.0: an open-source vocabulary of ethical hazards for data-intensive projects Natalie Zelenka 1, †, * , Nina H Di Cara 1, † , Euan Bennet 2 , Ismael Kherroubi Garcia 3 , Susana Romana Garcia 4 , Vanessa Aisyahsari Hanschke 1 , and Emma Kuwertz 5 1 University of Bristol 2 University of Glasgow 3 Kairoi 4 University of Edinburgh 5 European Centre for Medium-Range Weather Forecasts Joint first author * corresponding author: natalie.zelenka@bristol.ac.uk ABSTRACT The use of data-intensive methods for tasks which impact peoples’ lives continues to accelerate. This has resulted in several high- profile, seemingly avoidable, ethical mistakes. Despite this, those with the power to change how data science is developed can lack the incentives, training, or support, to properly consider the impacts of their work. This scrutiny is also not often provided by review boards, whose focus tends to be on the protection of human participants rather than downstream outcomes. Support- ing data scientists and technical staff to consider worst-case scenarios, and facilitate discussions with ethics experts and those impacted by their methods, could allow us to work together across disciplinary boundaries to identify and avoid possible neg- ative outcomes. A selected shared vocabulary could overcome two known barriers to such interdisciplinary conversations: a mismatched vocabulary and a lack of an accessible structure for discussions. This paper presents the Data Hazards project, which contains a shared, controlled vocabulary of 11 data ethics issues, presented as ‘labels’ which represent potential negative outcomes of data science work. Similarly to chemical hazard labels, the Data Hazards serve as a warning of a potential risk, and encourage adoption of appropriate safety measures. The labels were co-created as an open-source project, where they will continue to evolve over time with input from various communities. In this paper we present version 1.0. We present the 11 current labels, each consisting of: an image, description, examples of when it applies, and crucially, safety precautions that provide concrete steps to address concerns. The project provides documentation for using these labels that supports technical decision-makers to consider the perspectives of a diverse audience on these issues, through workshops or asynchronously. The ‘alpha’ set of Data Hazards were evaluated through a series of workshops, with participants (N=47). In these workshops participants saw a presentation on one of five data science projects and applied the labels to them before and after a structured re- flective discussion. The projects covered a range of data science applications from data collection to natural language processing. Overall, 94% of participants found the labels useful, 92% found the concepts clear, and 72% thought that they were easy to apply to a real project. Most workshop participants (89%) found the reflective discussion format useful, and we include reflections from two project owners on their experience of presenting their work. 1/14