Original Research Journal of Education in Perioperative Medicine: Vol. XXI, Issue 1 1 Quality Control for Residency Applicant Scores Jed Wolpaw, MD, MEd Gillian Isaac, MD, PhD Tina Tran, MD Mike Banks, MD Steven Beaudry, DO Priyanka Dwivedi, MA Serkan Toy, PhD Introduction Most residency programs in the United States use a candidate selection process. Te intended utility of this process is to ensure selection of the most qualifed candidates who are the best ft and therefore most likely to succeed. Tis process involves a diverse group of faculty judges/interviewers making inferences based on, typically, a review of application materials, interviews, and faculty group discussions. Te quality of the resulting rank list depends on measurement precision and accuracy. A precise measurement model produces linear and reproducible measures, and accurate measurement allows for targeting the actual candidate’s ability, free from confounders such as diferent faculty interviewers on diferent days. 1 Indeed, interrater reliability is low in interview scoring. 2, 3 We were unable to identify any commonly used quality control methods for evaluating the scores from interviewers before a rank list is made. And yet, the potential for poor quality data is real. A study found faculty interviewing candidates for medical school difered signifcantly in their degree of stringency or leniency. 4 Using diferent interviewers on diferent days has the potential to infate or defate scores for candidates on one day compared to candidates on another day. Many-facet Rasch measurement (MFRM) is a family of measurement models that allow for establishing a quality control system for rater-mediated assessment that can identify these outlier scores. Tis model has been proven useful for quality control in undergraduate medical education admissions. 4-6 We hypothesized that using an MFRM model, we could establish a quality control system to identify noise in our score data and address potential sources of measurement error to produce fair averages for each candidate. Methods Tis is an observational study that took place at a large academic medical center from October 2017 to January 2018. Te local institutional review board deemed this to be a quality improvement project. Te department in which this study took place interviews 160 candidates each year for 25 available spots. All interviews are conducted by 4 faculty members who interview 8 candidates per day. Two faculty members—the program director and associate program director—interview all 160 candidates. Te third interviewer spot is taken up by 1 of 3 assistant program directors. Te fourth interviewer is a faculty member who signs up to interview. Because the fourth interviewer was almost always a diferent faculty member (17 diferent faculty members flled this spot over the course of 20 interview days), and thus only gave 1 set of scores. We had to exclude their scores from our analysis in order to establish connectivity in the dataset. Interviewers are given a description of the scoring scale, which ranged from 1 to 100. Te scale has been used by our department for many years and defnes a score in the 90s as someone who could be a chief resident, a score in the 80s as someone who we would be happy to have, a score in the 70s as someone who would probably do fne, a score in the 60s as someone we might put at the bottom of our rank list, and a score below 60 as someone we would consider not ranking at all. Tis scale is sent to each interviewer along with the applications to read. Interviewers gave 3 sets of scores, which were entered into Qualtrix (Provo, Utah). Te frst score was given afer reading the application but before the interview. Te second score was given afer the interview but before the group discussion in which all 4 interviewers discussed the candidates. Te fnal score was given afer the discussion. Data Analysis Using MFRM, one can examine multiple variables (facets) that might be potential sources of variance for the outcome variable. 7 For example, in addition to examinees’ ability (candidate ftness for the residency program), testing situation (or scoring occasion, ie review of application documents vs. interview), or rater leniency/ severity could also infuence the scores candidates receive. Tis psychometric approach produces standardized indices for determining the degree to which the data ft the expectations predicted by the model. Expected fair averages for candidate qualifcation/ft are calculated based on the observed values adjusted for rater leniency/severity and/or task difculty (scoring occasion). Te diference between the observed and expected values, called standardized residuals, indicates the quality of the data and the accuracy of the measurement. 1 Tere are 2 types of mean-square (MnSq) ft indices, outft and inft, that help fag the values presenting misft. Outft MnSq continued on next page J E P M Te Journal of Education in Perioperative Medicine Original Research