Semantic Approach for Increasing Test Case
Coverage in Automated Grading of Programming
Exercise
M. Rifky I. Bariansyah
School of Electrical Engineering and
Informatics, Institut Teknologi Bandung
Bandung, Indonesia
Email: 13517081@std.stei.itb.ac.id
Satrio Adi Rukmono
School of Electrical Engineering and
Informatics, Institut Teknologi Bandung
Bandung, Indonesia
Email: sar@itb.ac.id
Riza Satria Perdana
School of Electrical Engineering and
Informatics, Institut Teknologi Bandung
Bandung, Indonesia
Email: riza@informatika.org
Abstract—The widely popular approach for automatic grading
in computer science is to run black-box testing against the
student’s implementation. This kind of autograder evaluate
programs solely based on their outputs given a set of inputs.
However, manually writing a set of test cases with high coverage
is laborious and inefficient. Hence, we explore another alternative
approach in building test cases, specifically white-box testing.
In theory, by knowing the internal workings of implementation,
we can evaluate all possible execution paths, producing better
test cases coverage, ultimately producing a complete grading. In
this paper, we present research on using semantic analysis to
generate test cases to determine the correctness of a student’s
implementation. Instead of writing test cases, the evaluator
will write a reference code, a correct implementation based
on the programming problem specification. We implement a
system that records execution paths, detects path deviation, and
checks path equivalence to analyze the semantic difference of
the reference code and student’s implementation. The system
is built utilizing a concolic execution method for exploration
and an SMT solver to solve formulas. Our experiments reveal
that it is possible to automatically generate test cases and grade
programming assignments by analyzing the semantic difference
between reference and student implementation. Compared with
grading using a random test case generator, it is evident that
the system can provide better test case coverage for automatic
grading in many occurrences.
Index Terms—automatic grading, test case generation, symbolic
execution
I. I NTRODUCTION
In computer science, programming exercise is used by
students as a medium to implement theoretical knowledge
into a program. Students rely on programming assignments
grade as a study guide and feedback on their progress.
However, manually grading programming assignments is time-
consuming and not feasible for a large class. The more students
in a class, the higher the possibility of error in grading. This
problem has pushed the research effort on automatic grading.
With automatic grading, students can receive feedback quickly,
which increases the possibility to rework an incorrect imple-
mentation. The majority of automatic grading systems uses the
black-box testing approach [1]. In this approach, the instructor
or evaluator writes a set of test cases for the programming
problem. The correctness of a student’s implementation will
then be determined using this set of test cases. However,
writing a complete set of test cases, covering most if not all
edge cases, requires a high amount of effort. This issue risks
grading with an incomplete set of test cases, producing grades
that do not reflect a student’s abilities well.
This problem calls out the necessity for a different approach
to writing test cases for programming exercises. This paper ex-
plores the potential of utilising a white-box testing technique,
specifically semantic difference analysis, for generating test
cases with better coverage. We present PyAssesment, a refer-
ence implementation of the automated grading system based
on concolic execution for Python programming assignments.
PyAssesment receives a reference code, i.e., a solution from
the evaluator, and a student implementation as inputs. The
system observes the semantical difference between the two
implementations to generate a set of test cases to determine
the correctness of the student implementation.
This paper is structured as follows. We first discuss the
foundational basis of our work in Section II. Then, we explain
our approach in generating test cases in Section III. Next, we
present the result of our experiments in Section IV and discuss
the key insights in Section V. Finally, we conclude and suggest
further research direction in Section VI.
II. FOUNDATIONAL BASIS
A. Automatic Grading
An automatic grading system is used for grading pro-
gramming assignments in scientific computing [2]. It is built
to increase the speed and capacity for evaluating students’
submissions. A study shows that automatic grading on an
introductory computing course positively impacts students’
learning process as a feedback system. It increases the number
of resubmissions, which indicates the usage of feedback to
correct their implementations. In general, there are two ap-
proaches for automatic grading systems: black-box and white-
box testing.
Black-box testing or functional testing utilises test cases
written based on the program’s specifications. This kind of
2021 International Conference on Data and Software Engineering (ICoDSE)
978-1-6654-9453-3/21/.00 ©2021 IEEE