2nd International Workshop on Data Qality Assessment for
Machine Learning
Hima Patel
∗
IBM Research, India
himapatel@in.ibm.com
Fuyuki Ishikawa
NII, Japan
f-ishikawa@nii.ac.jp
Laure Berti-Equille
IRD, France
laure.berti@ird.fr
Nitin Gupta
IBM Research, India
ngupta47@in.ibm.com
Sameep Mehta
IBM Research, India
sameepmehta@in.ibm.com
Satoshi Masuda
IBM Research, Japan
smasuda@jp.ibm.com
Shashank Mujumdar
IBM Research, India
shamujum@in.ibm.com
Shazia Afzal
IBM Research, India
shaafzal@in.ibm.com
Srikanta Bedathur
IIT Delhi, India
srikanta@cse.iitd.ac.in
Yasuharu Nishi
UEC, Japan
yasuharu.nishi@uec.ac.jp
ABSTRACT
The 2nd International Workshop on Data Quality Assessment for
Machine Learning (DQAML’21) is organized in conjunction with
the Special Interest Group on Knowledge Discovery and Data Min-
ing (SIGKDD). This workshop aims to serve as a forum for the
presentation of research related to data quality assessment and
remediation in AI/ML pipeline. Data quality is a critical issue in
the data preparation phase and involves numerous challenging
problems related to detection, remediation, visualization and eval-
uation of data issues. The workshop aims to provide a platform
to researchers and practitioners to discuss such challenges across
diferent modalities of data like structured, time series, text and
graphical. The aim is to attract perspectives from both industrial
and academic circles.
CCS CONCEPTS
· Computing methodologies → Machine learning algorithms.
KEYWORDS
Data Quality, Data Assessment, Machine Learning
ACM Reference Format:
Hima Patel, Fuyuki Ishikawa, Laure Berti-Equille, Nitin Gupta, Sameep
Mehta, Satoshi Masuda, Shashank Mujumdar, Shazia Afzal, Srikanta Be-
dathur, and Yasuharu Nishi. 2021. 2nd International Workshop on Data
Quality Assessment for Machine Learning. In Proceedings of the 27th ACM
SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’21),
∗
Author names are arranged alphabetically.
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for proft or commercial advantage and that copies bear this notice and the full citation
on the frst page. Copyrights for third-party components of this work must be honored.
For all other uses, contact the owner/author(s).
KDD ’21, August 14–18, 2021, Virtual Event, Singapore.
© 2021 Copyright held by the owner/author(s).
ACM ISBN 978-1-4503-8332-5/21/08.
https://doi.org/10.1145/3447548.3469468
August 14–18, 2021, Virtual Event, Singapore. ACM, New York, NY, USA,
2 pages. https://doi.org/10.1145/3447548.3469468
1 BACKGROUND
In the past decade, AI/ML technologies have become pervasive in
academia and industry, fnding their utility in newer and challeng-
ing applications. While most eforts are focused towards building
better, smarter and automated ML models, comparatively lesser
attention has been paid to systematically understand the challenges
in training data and assess its quality before it is fed to an ML
pipeline. Issues such as incorrect labels, synonymous categories in
a categorical variable, heterogeneity in columns, class imbalance,
etc. that might go undetected by standard data pre-processing mod-
ules can lead to increased model complexity or cause sub-optimal
model performance. So data preparation [1ś5] is being rightly been
called out as one of the most time-consuming and challenging step
in an AI lifecyle. This is because the quality of data remains un-
known at the acquisition stage triggering an iterative debugging
process mostly leveraging the experience of a data scientist. Data
quality assessment for ML is therefore an important step to estimate
and improve the data quality by identifying anomalies and apply-
ing corresponding remediation algorithms. Interesting issues that
spawn this thread are related to optimizing humans-in-the-loop,
ofering adequate explanations, supporting smart visualizations,
and actionable insights amongst other topics listed in Section 1.1.
The goal of this workshop is to attract researchers working in
related felds of data acquisition, data labeling, data quality, data
preparation and AutoML areas to understand how the data issues,
their detection and remediation will help towards building better
models. This workshop invites researchers from academia and in-
dustry to submit novel propositions for systematically identifying
and mitigating data issues for making data AI ready.
Workshop Summary KDD ’21, August 14–18, 2021, Virtual Event, Singapore
4147