2nd International Workshop on Data Qality Assessment for Machine Learning Hima Patel IBM Research, India himapatel@in.ibm.com Fuyuki Ishikawa NII, Japan f-ishikawa@nii.ac.jp Laure Berti-Equille IRD, France laure.berti@ird.fr Nitin Gupta IBM Research, India ngupta47@in.ibm.com Sameep Mehta IBM Research, India sameepmehta@in.ibm.com Satoshi Masuda IBM Research, Japan smasuda@jp.ibm.com Shashank Mujumdar IBM Research, India shamujum@in.ibm.com Shazia Afzal IBM Research, India shaafzal@in.ibm.com Srikanta Bedathur IIT Delhi, India srikanta@cse.iitd.ac.in Yasuharu Nishi UEC, Japan yasuharu.nishi@uec.ac.jp ABSTRACT The 2nd International Workshop on Data Quality Assessment for Machine Learning (DQAML’21) is organized in conjunction with the Special Interest Group on Knowledge Discovery and Data Min- ing (SIGKDD). This workshop aims to serve as a forum for the presentation of research related to data quality assessment and remediation in AI/ML pipeline. Data quality is a critical issue in the data preparation phase and involves numerous challenging problems related to detection, remediation, visualization and eval- uation of data issues. The workshop aims to provide a platform to researchers and practitioners to discuss such challenges across diferent modalities of data like structured, time series, text and graphical. The aim is to attract perspectives from both industrial and academic circles. CCS CONCEPTS · Computing methodologies Machine learning algorithms. KEYWORDS Data Quality, Data Assessment, Machine Learning ACM Reference Format: Hima Patel, Fuyuki Ishikawa, Laure Berti-Equille, Nitin Gupta, Sameep Mehta, Satoshi Masuda, Shashank Mujumdar, Shazia Afzal, Srikanta Be- dathur, and Yasuharu Nishi. 2021. 2nd International Workshop on Data Quality Assessment for Machine Learning. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’21), Author names are arranged alphabetically. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proft or commercial advantage and that copies bear this notice and the full citation on the frst page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). KDD ’21, August 14–18, 2021, Virtual Event, Singapore. © 2021 Copyright held by the owner/author(s). ACM ISBN 978-1-4503-8332-5/21/08. https://doi.org/10.1145/3447548.3469468 August 14–18, 2021, Virtual Event, Singapore. ACM, New York, NY, USA, 2 pages. https://doi.org/10.1145/3447548.3469468 1 BACKGROUND In the past decade, AI/ML technologies have become pervasive in academia and industry, fnding their utility in newer and challeng- ing applications. While most eforts are focused towards building better, smarter and automated ML models, comparatively lesser attention has been paid to systematically understand the challenges in training data and assess its quality before it is fed to an ML pipeline. Issues such as incorrect labels, synonymous categories in a categorical variable, heterogeneity in columns, class imbalance, etc. that might go undetected by standard data pre-processing mod- ules can lead to increased model complexity or cause sub-optimal model performance. So data preparation [1ś5] is being rightly been called out as one of the most time-consuming and challenging step in an AI lifecyle. This is because the quality of data remains un- known at the acquisition stage triggering an iterative debugging process mostly leveraging the experience of a data scientist. Data quality assessment for ML is therefore an important step to estimate and improve the data quality by identifying anomalies and apply- ing corresponding remediation algorithms. Interesting issues that spawn this thread are related to optimizing humans-in-the-loop, ofering adequate explanations, supporting smart visualizations, and actionable insights amongst other topics listed in Section 1.1. The goal of this workshop is to attract researchers working in related felds of data acquisition, data labeling, data quality, data preparation and AutoML areas to understand how the data issues, their detection and remediation will help towards building better models. This workshop invites researchers from academia and in- dustry to submit novel propositions for systematically identifying and mitigating data issues for making data AI ready. Workshop Summary KDD ’21, August 14–18, 2021, Virtual Event, Singapore 4147