Discrimination of Automatically Generated Questions
Used as Formative Practice
Benny G. Johnson
VitalSource Technologies
Pitsburgh, PA, USA
benny.johnson@vitalsource.com
Jefrey S. Ditel
VitalSource Technologies
Milwaukee, WI, USA
jef.ditel@vitalsource.com
Rachel Van Campenhout
VitalSource Technologies
Pitsburgh, PA, USA
rachel@acrobatiq.com
Bill Jerome
VitalSource Technologies
Pitsburgh, PA, USA
bill.jerome@vitalsource.com
ABSTRACT
Advances in artifcial intelligence and automatic question
generation (AQG) have made it possible to generate the volume
of formative practice questions needed to engage students in
learning by doing. Tese automatically generated (AG) questions
can be integrated with textbook content in a courseware
environment so that students can practice as they read. Scaling
this learn by doing method is a valuable pursuit, as it is proven to
cause beter learning outcomes (i.e., the doer efect). However, it
is also necessary to ensure these AG questions perform equally as
well as human-authored (HA) questions. In previous studies, it
was found that AG and HA questions were essentially equivalent
with respect to student engagement, difculty, and persistence.
While these question performance metrics expanded existing
AQG research, this paper further extends this research by
evaluating question discrimination using student data from a
university Neuroscience course. It is found that the AG questions
also perform as well as HA questions with respect to
discrimination.
CCS CONCEPTS
•Applied computing~Education~Interactive learning
environments•Computing methodologies~Artifcial
intelligence•General and reference~Cross-computing tools and
techniques~Metrics•General and reference~Cross-computing
tools and techniques~Performance
KEYWORDS
Automatic question generation; Automatically generated
questions; Formative practice; Qestion discrimination; Item
response theory; Courseware; Artifcial intelligence; Natural
language processing; In vivo experimentation.
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for proft or commercial advantage and that copies bear this notice and the full
citation on the frst page. Copyrights for third-party components of this work must
be honored. For all other uses, contact the Owner/Author(s).
L@S '22, June 1–3, 2022, New York City, NY, USA
© 2022 Copyright is held by the owner/author(s).
ACM ISBN 978-1-4503-9158-0/22/06.
htps://doi.org/10.1145/3491140.3528323
ACM Reference format:
Benny G. Johnson, Jeffrey S. Dittel, Rachel Van Campenhout, & Bill Jerome.
2022. Discrimination of Automatically Generated Questions Used as
Formative Practice. In Proceedings of the Ninth ACM Conference on Learning
@ Scale (L@S '22), June 1–3, 2022, New York City, NY, USA. ACM, New York,
NY, USA. 5 pages. https://doi.org/10.1145/3491140.3528323
1 Introduction
Automatic question generation (AQG) is of particular interest in
educational contexts for its use in varied applications and for
diverse subjects and learners. One high-value use of AQG is to
generate formative practice to align with expository text content,
as this method of learn by doing has been proven to cause better
learning outcomes [9, 16]. In a recent systematic review, Kurdi et
al. [10] reviewed 93 AQG studies and concluded there was no
“gold standard” of AQG performance largely due to the
heterogeneity of quality metrics reported. It is also clear that AQG
research would benefit from evaluating performance using
student data from natural learning contexts, rather than having to
rely primarily on expert review. In previous work [14, 15], we
conducted the largest comparison at its time of automatically
generated (AG) questions and human-authored (HA) questions,
using student data from authentic learning contexts, and found
that AG and HA questions performed similarly on measures of
engagement, difficulty, and persistence; there was no practical
difference in how students interacted with them. In this study, we
extend that investigation to discrimination, an important
psychometric property and measure of question quality, using
student data for AG and HA questions in a courseware
environment from a university Neuroscience course. To our
knowledge, this is the first reported analysis of discrimination of
AG questions, which requires larger data sets than many other
metrics of interest such as difficulty.
Discrimination is the concept of how well a question can
distinguish between high-ability and low-ability students. The
greater a question's discrimination, the more likely a correct
answer indicates that the student has a higher ability and that an
incorrect answer indicates a lower ability. Conversely, low
discrimination means the correctness of the answer is less
indicative of the student’s ability. As such, questions with very
low discrimination do not give sufficient information about
student ability, and it is better to revise or replace them. We can
therefore see the benefit of discrimination as a metric to assess the
Work in Progress L@S ’22, June 1–3, 2022, New York City, NY, USA
325