Semantic Approach for Increasing Test Case Coverage in Automated Grading of Programming Exercise M. Rifky I. Bariansyah School of Electrical Engineering and Informatics, Institut Teknologi Bandung Bandung, Indonesia Email: 13517081@std.stei.itb.ac.id Satrio Adi Rukmono School of Electrical Engineering and Informatics, Institut Teknologi Bandung Bandung, Indonesia Email: sar@itb.ac.id Riza Satria Perdana School of Electrical Engineering and Informatics, Institut Teknologi Bandung Bandung, Indonesia Email: riza@informatika.org Abstract—The widely popular approach for automatic grading in computer science is to run black-box testing against the student’s implementation. This kind of autograder evaluate programs solely based on their outputs given a set of inputs. However, manually writing a set of test cases with high coverage is laborious and inefﬁcient. Hence, we explore another alternative approach in building test cases, speciﬁcally white-box testing. In theory, by knowing the internal workings of implementation, we can evaluate all possible execution paths, producing better test cases coverage, ultimately producing a complete grading. In this paper, we present research on using semantic analysis to generate test cases to determine the correctness of a student’s implementation. Instead of writing test cases, the evaluator will write a reference code, a correct implementation based on the programming problem speciﬁcation. We implement a system that records execution paths, detects path deviation, and checks path equivalence to analyze the semantic difference of the reference code and student’s implementation. The system is built utilizing a concolic execution method for exploration and an SMT solver to solve formulas. Our experiments reveal that it is possible to automatically generate test cases and grade programming assignments by analyzing the semantic difference between reference and student implementation. Compared with grading using a random test case generator, it is evident that the system can provide better test case coverage for automatic grading in many occurrences. Index Terms—automatic grading, test case generation, symbolic execution I. I NTRODUCTION In computer science, programming exercise is used by students as a medium to implement theoretical knowledge into a program. Students rely on programming assignments grade as a study guide and feedback on their progress. However, manually grading programming assignments is time- consuming and not feasible for a large class. The more students in a class, the higher the possibility of error in grading. This problem has pushed the research effort on automatic grading. With automatic grading, students can receive feedback quickly, which increases the possibility to rework an incorrect imple- mentation. The majority of automatic grading systems uses the black-box testing approach [1]. In this approach, the instructor or evaluator writes a set of test cases for the programming problem. The correctness of a student’s implementation will then be determined using this set of test cases. However, writing a complete set of test cases, covering most if not all edge cases, requires a high amount of effort. This issue risks grading with an incomplete set of test cases, producing grades that do not reﬂect a student’s abilities well. This problem calls out the necessity for a different approach to writing test cases for programming exercises. This paper ex- plores the potential of utilising a white-box testing technique, speciﬁcally semantic difference analysis, for generating test cases with better coverage. We present PyAssesment, a refer- ence implementation of the automated grading system based on concolic execution for Python programming assignments. PyAssesment receives a reference code, i.e., a solution from the evaluator, and a student implementation as inputs. The system observes the semantical difference between the two implementations to generate a set of test cases to determine the correctness of the student implementation. This paper is structured as follows. We ﬁrst discuss the foundational basis of our work in Section II. Then, we explain our approach in generating test cases in Section III. Next, we present the result of our experiments in Section IV and discuss the key insights in Section V. Finally, we conclude and suggest further research direction in Section VI. II. FOUNDATIONAL BASIS A. Automatic Grading An automatic grading system is used for grading pro- gramming assignments in scientiﬁc computing [2]. It is built to increase the speed and capacity for evaluating students’ submissions. A study shows that automatic grading on an introductory computing course positively impacts students’ learning process as a feedback system. It increases the number of resubmissions, which indicates the usage of feedback to correct their implementations. In general, there are two ap- proaches for automatic grading systems: black-box and white- box testing. Black-box testing or functional testing utilises test cases written based on the program’s speciﬁcations. This kind of 2021 International Conference on Data and Software Engineering (ICoDSE) 978-1-6654-9453-3/21/.00 ©2021 IEEE