Using Transformer Language Models to Validate Peer-Assigned
Essay Scores in Massive Open Online Courses (MOOCs)
Wesley Morris
wesley.g.morris@vanderbilt.edu
Vanderbilt University
Nashville, Tennessee, USA
Scott A. Crossley
scott.crossley@vanderbilt.edu
Vanderbilt University
Nashville, Tennessee, USA
Langdon Holmes
langdon.holmes@vanderbilt.edu
Vanderbilt University
Nashville, Tennessee, USA
Anne Trumbore
TrumboreA@darden.virginia.edu
University of Virginia
USA
ABSTRACT
Massive Open Online Courses (MOOCs) such as those ofered by
Coursera are popular ways for adults to gain important skills, ad-
vance their careers, and pursue their interests. Within these courses,
students are often required to compose, submit, and peer review
written essays, providing a valuable pedagogical experience for the
student and a wealth of natural language data for the educational
researcher. However, the scores provided by peers do not always
refect the actual quality of the text, generating questions about the
reliability and validity of the scores. This study evaluates methods
to increase the reliability of MOOC peer-review ratings through
a series of validation tests on peer-reviewed essays. Reliability of
reviewers was based on correlations between text length and essay
quality. Raters were pruned based on score variance and the lexical
diversity observed in their comments to create sub-sets of raters.
Each subset was then used as training data to fnetune distilBERT
large language models to automatically score essay quality as a
measure of validation. The accuracy of each language model for
each subset was evaluated. We fnd that training language models
on data subsets produced by more reliable raters based on a combi-
nation of score variance and lexical diversity produce more accurate
essay scoring models. The approach developed in this study should
allow for enhanced reliability of peer-reviewed scoring in MOOCS
afording greater credibility within the systems.
CCS CONCEPTS
· General and reference → Validation;· Computing method-
ologies → Natural language processing;· Applied computing
→ E-learning; Collaborative learning.
KEYWORDS
transformers, natural language processing, rater reliability, moocs
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for proft or commercial advantage and that copies bear this notice and the full citation
on the frst page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specifc permission
and/or a fee. Request permissions from permissions@acm.org.
LAK 2023, March 13ś17, 2023, Arlington, TX, USA
© 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 978-1-4503-9865-7/23/03. . . $15.00
https://doi.org/10.1145/3576050.3576098
ACM Reference Format:
Wesley Morris, Scott A. Crossley, Langdon Holmes, and Anne Trumbore.
2023. Using Transformer Language Models to Validate Peer-Assigned Es-
say Scores in Massive Open Online Courses (MOOCs). In LAK23: 13th
International Learning Analytics and Knowledge Conference (LAK 2023),
March 13ś17, 2023, Arlington, TX, USA. ACM, New York, NY, USA, 9 pages.
https://doi.org/10.1145/3576050.3576098
1 INTRODUCTION
Ever since their introduction in 2008, Massive Open Online Courses
(MOOCs) have provided opportunities for skill development and
recognition for students that may not be able to attend traditional
education courses [30]. The versatility of MOOCS has led to their
quick growth and, in 2016, the popular MOOC hosting site Coursera
had over 17.5 million registered users [39]. Although concerns have
been raised regarding the retention of students in the courses [34],
students continue to perceive participation in MOOCs as a way to
develop their cognitive interests, career goals, and interpersonal
relationships [15, 20].
Assessment results generated by these courses provide a wealth
of data for researchers, teachers, and administrators [8, 12, 36].
However, much of this behavioral data is based on click-stream logs,
and data about learning is generally based on closed assessment
such as multiple-choice items. The use of open responses such
as essays are difcult to manage in MOOCs because the sheer
number of students makes personalized teacher feedback difcult.
One solution to incorporating open ended assessments in MOOCs
has been for students in the course to review samples written by
the other students and assign scores to those samples based on
holistic or analytic rubrics [1]. Unfortunately, research indicates
that these peer-assigned scores may have serious problems with
reliability and validity [19, 26, 43].
This paper seeks to address these problems by providing a method
to increase the reliability of peer-assigned scores. We examine a
corpus of 27,909 essays produced as a capstone project to a MOOC
on design principles hosted by Coursera. We generate subsets of the
essays by pruning raters suspected of providing unreliable scores
based on the reviewer score variance and the lexical diversity of
their comments. We use the correlations between score and word
count as a measure of criterion validity knowing that longer es-
says receive higher scores [13, 32]. We then develop large language
models (LLMs) for each subset to predict the essay scores from the
pruned subset of raters and tested the accuracy of the models by
315