Using Transformer Language Models to Validate Peer-Assigned Essay Scores in Massive Open Online Courses (MOOCs) Wesley Morris wesley.g.morris@vanderbilt.edu Vanderbilt University Nashville, Tennessee, USA Scott A. Crossley scott.crossley@vanderbilt.edu Vanderbilt University Nashville, Tennessee, USA Langdon Holmes langdon.holmes@vanderbilt.edu Vanderbilt University Nashville, Tennessee, USA Anne Trumbore TrumboreA@darden.virginia.edu University of Virginia USA ABSTRACT Massive Open Online Courses (MOOCs) such as those ofered by Coursera are popular ways for adults to gain important skills, ad- vance their careers, and pursue their interests. Within these courses, students are often required to compose, submit, and peer review written essays, providing a valuable pedagogical experience for the student and a wealth of natural language data for the educational researcher. However, the scores provided by peers do not always refect the actual quality of the text, generating questions about the reliability and validity of the scores. This study evaluates methods to increase the reliability of MOOC peer-review ratings through a series of validation tests on peer-reviewed essays. Reliability of reviewers was based on correlations between text length and essay quality. Raters were pruned based on score variance and the lexical diversity observed in their comments to create sub-sets of raters. Each subset was then used as training data to fnetune distilBERT large language models to automatically score essay quality as a measure of validation. The accuracy of each language model for each subset was evaluated. We fnd that training language models on data subsets produced by more reliable raters based on a combi- nation of score variance and lexical diversity produce more accurate essay scoring models. The approach developed in this study should allow for enhanced reliability of peer-reviewed scoring in MOOCS afording greater credibility within the systems. CCS CONCEPTS · General and reference ValidationComputing method- ologies Natural language processingApplied computing E-learning; Collaborative learning. KEYWORDS transformers, natural language processing, rater reliability, moocs Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proft or commercial advantage and that copies bear this notice and the full citation on the frst page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specifc permission and/or a fee. Request permissions from permissions@acm.org. LAK 2023, March 13ś17, 2023, Arlington, TX, USA © 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM. ACM ISBN 978-1-4503-9865-7/23/03. . . $15.00 https://doi.org/10.1145/3576050.3576098 ACM Reference Format: Wesley Morris, Scott A. Crossley, Langdon Holmes, and Anne Trumbore. 2023. Using Transformer Language Models to Validate Peer-Assigned Es- say Scores in Massive Open Online Courses (MOOCs). In LAK23: 13th International Learning Analytics and Knowledge Conference (LAK 2023), March 13ś17, 2023, Arlington, TX, USA. ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/3576050.3576098 1 INTRODUCTION Ever since their introduction in 2008, Massive Open Online Courses (MOOCs) have provided opportunities for skill development and recognition for students that may not be able to attend traditional education courses [30]. The versatility of MOOCS has led to their quick growth and, in 2016, the popular MOOC hosting site Coursera had over 17.5 million registered users [39]. Although concerns have been raised regarding the retention of students in the courses [34], students continue to perceive participation in MOOCs as a way to develop their cognitive interests, career goals, and interpersonal relationships [15, 20]. Assessment results generated by these courses provide a wealth of data for researchers, teachers, and administrators [8, 12, 36]. However, much of this behavioral data is based on click-stream logs, and data about learning is generally based on closed assessment such as multiple-choice items. The use of open responses such as essays are difcult to manage in MOOCs because the sheer number of students makes personalized teacher feedback difcult. One solution to incorporating open ended assessments in MOOCs has been for students in the course to review samples written by the other students and assign scores to those samples based on holistic or analytic rubrics [1]. Unfortunately, research indicates that these peer-assigned scores may have serious problems with reliability and validity [19, 26, 43]. This paper seeks to address these problems by providing a method to increase the reliability of peer-assigned scores. We examine a corpus of 27,909 essays produced as a capstone project to a MOOC on design principles hosted by Coursera. We generate subsets of the essays by pruning raters suspected of providing unreliable scores based on the reviewer score variance and the lexical diversity of their comments. We use the correlations between score and word count as a measure of criterion validity knowing that longer es- says receive higher scores [13, 32]. We then develop large language models (LLMs) for each subset to predict the essay scores from the pruned subset of raters and tested the accuracy of the models by 315