Simpson’s paradox: how performance measurement can fail even with perfect risk adjustment Perla J Marang-van de Mheen, 1 Kaveh G Shojania 2 1 Department of Medical Decision Making, Leiden University Medical Centre, Leiden, The Netherlands 2 University of Toronto Centre for Quality Improvement and Patient Safety, Sunnybrook Health Sciences Centre, Toronto, Canada Correspondence to Dr Perla J Marang-van de Mheen, Department of Medical Decision Making, Leiden University Medical Centre, PO Box 9600, Leiden 2300 RC, The Netherlands; p.j.marang@lumc.nl Accepted 2 July 2014 ▸ http://dx.doi.org/10.1136/ bmjqs-2013-002608 To cite: Marang-van de Mheen PJ, Shojania KG. BMJ Qual Saf 2014;23:701–705. Efforts to measure quality using patient outcomes—whether hospital mortality rates or major complication rates for indi- vidual surgery—often become mired in debates over the adequacy of adjustment for case-mix. Some hospitals take care of sicker patients than other hospitals. Some surgeons operate on patients whom other surgeons feel exceed their skill levels. We do not want to penalise hospitals or doctors who accept referrals for more complex patients. Yet, we also do not want to miss opportunities for improve- ment. Maybe a particular hospital that cares for sicker patients achieves worse outcomes than other hospitals with similar patient populations. This debate over the adequacy of case-mix adjustment dates back to Florence Nightingale’s publication of league tables for mortality in 19th century English hospitals. 1 We have made some progress. Some successes have involved supplementing the diagnostic codes and demographic information available in administrative data with a few key clinical variables. 2 3 Particularly notable successes consist entirely of clin- ical variables collected for the sole purpose of predicting risk, such as the various prognostic scoring systems for critically ill patients, such as the Acute Physiology and Chronic Health Evaluation and the Simplified Acute Physiology Score 4–6 and the National Surgical Quality Improvement Program. 7 (Occasionally, research shows that an outcome measure does not require adjust- ment for case-mix. 8 ) But, what if comparing mortality rates (or other key patient outcomes) were problematic even with perfect case-mix adjustment? For example, suppose a 75-year-old man undergoing cardiac surgery has diabetes, mild kidney failure and a previous stroke and a 65-year–old woman has hypertension but no previous strokes or kidney problems. Suppose the case-mix adjustment model assigns a risk of death or major complications after surgery of 8% to the 75-year-old man and only 4% to the 65-year-old woman. And, let’s say that over time, we see that patients who share the characteristics of the 75-year-old man experience bad out- comes 8% of the time, whereas patients who resemble the 65-year–old woman experience the lower complication rate of 4%. And, let’s even add that the model works this well (ie, perfectly) for every type of patient. Having a model like this would seem to put to rest all the debates over the fairness of outcome-based per- formance measures. Disturbingly, it does not, as first pointed out by Simpson and Yule over 50 years ago. 9 10 SIMPSON’S PARADOX Simpson’s paradox (also known as the Yule–Simpson effect) 9 10 refers to an association or effect found within mul- tiple subgroups but which is reversed when data from these groups are aggre- gated. One non-technical exposition used batting averages of two prominent profes- sional baseball players as an example (table 1). 11 The batting average represents the number of hits divided by the number of ‘at-bats’ (the number of opportunities the player had to hit the ball). In both 1995 and 1996, David Justice had a higher (better) batting average than Derek Jeter. However, aggregating reverses their ranking, with Jeter having the higher batting average in the 2 years combined. This reversal results from the large difference in the number of at-bats between the years, so that the combined average of Jeter was determined most by the 1996 average (which was better than 1995), whereas the opposite was true for Justice. Ross EDITORIAL Marang-van de Mheen PJ, et al. BMJ Qual Saf 2014;23:701–705. doi:10.1136/bmjqs-2014-003358 701 group.bmj.com on April 29, 2016 - Published by http://qualitysafety.bmj.com/ Downloaded from