Discrimination of Automatically Generated Questions Used as Formative Practice Benny G. Johnson VitalSource Technologies Pitsburgh, PA, USA benny.johnson@vitalsource.com Jefrey S. Ditel VitalSource Technologies Milwaukee, WI, USA jef.ditel@vitalsource.com Rachel Van Campenhout VitalSource Technologies Pitsburgh, PA, USA rachel@acrobatiq.com Bill Jerome VitalSource Technologies Pitsburgh, PA, USA bill.jerome@vitalsource.com ABSTRACT Advances in artifcial intelligence and automatic question generation (AQG) have made it possible to generate the volume of formative practice questions needed to engage students in learning by doing. Tese automatically generated (AG) questions can be integrated with textbook content in a courseware environment so that students can practice as they read. Scaling this learn by doing method is a valuable pursuit, as it is proven to cause beter learning outcomes (i.e., the doer efect). However, it is also necessary to ensure these AG questions perform equally as well as human-authored (HA) questions. In previous studies, it was found that AG and HA questions were essentially equivalent with respect to student engagement, difculty, and persistence. While these question performance metrics expanded existing AQG research, this paper further extends this research by evaluating question discrimination using student data from a university Neuroscience course. It is found that the AG questions also perform as well as HA questions with respect to discrimination. CCS CONCEPTS •Applied computing~Education~Interactive learning environments•Computing methodologies~Artifcial intelligence•General and reference~Cross-computing tools and techniques~Metrics•General and reference~Cross-computing tools and techniques~Performance KEYWORDS Automatic question generation; Automatically generated questions; Formative practice; Qestion discrimination; Item response theory; Courseware; Artifcial intelligence; Natural language processing; In vivo experimentation. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proft or commercial advantage and that copies bear this notice and the full citation on the frst page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author(s). L@S '22, June 1–3, 2022, New York City, NY, USA © 2022 Copyright is held by the owner/author(s). ACM ISBN 978-1-4503-9158-0/22/06. htps://doi.org/10.1145/3491140.3528323 ACM Reference format: Benny G. Johnson, Jeffrey S. Dittel, Rachel Van Campenhout, & Bill Jerome. 2022. Discrimination of Automatically Generated Questions Used as Formative Practice. In Proceedings of the Ninth ACM Conference on Learning @ Scale (L@S '22), June 1–3, 2022, New York City, NY, USA. ACM, New York, NY, USA. 5 pages. https://doi.org/10.1145/3491140.3528323 1 Introduction Automatic question generation (AQG) is of particular interest in educational contexts for its use in varied applications and for diverse subjects and learners. One high-value use of AQG is to generate formative practice to align with expository text content, as this method of learn by doing has been proven to cause better learning outcomes [9, 16]. In a recent systematic review, Kurdi et al. [10] reviewed 93 AQG studies and concluded there was no “gold standard” of AQG performance largely due to the heterogeneity of quality metrics reported. It is also clear that AQG research would benefit from evaluating performance using student data from natural learning contexts, rather than having to rely primarily on expert review. In previous work [14, 15], we conducted the largest comparison at its time of automatically generated (AG) questions and human-authored (HA) questions, using student data from authentic learning contexts, and found that AG and HA questions performed similarly on measures of engagement, difficulty, and persistence; there was no practical difference in how students interacted with them. In this study, we extend that investigation to discrimination, an important psychometric property and measure of question quality, using student data for AG and HA questions in a courseware environment from a university Neuroscience course. To our knowledge, this is the first reported analysis of discrimination of AG questions, which requires larger data sets than many other metrics of interest such as difficulty. Discrimination is the concept of how well a question can distinguish between high-ability and low-ability students. The greater a question's discrimination, the more likely a correct answer indicates that the student has a higher ability and that an incorrect answer indicates a lower ability. Conversely, low discrimination means the correctness of the answer is less indicative of the student’s ability. As such, questions with very low discrimination do not give sufficient information about student ability, and it is better to revise or replace them. We can therefore see the benefit of discrimination as a metric to assess the Work in Progress L@S ’22, June 1–3, 2022, New York City, NY, USA 325