Copyright © Lippincott Williams & Wilkins. Unauthorized reproduction of this article is prohibited. EDITORIAL REVIEW Methodologies for assessing agreement between two methods of clinical measurement: are we as good as we think we are? Maurizio Cecconi, Michael Grounds and Andrew Rhodes Department of Intensive Care, St George’s Hospital, London, UK Correspondence to Dr Andrew Rhodes, Department of Intensive Care, St George’s Hospital, London, SW17 0QT, UK Tel: +44 208 725 0884; fax: +44 208 725 0879; e-mail: andyr@sgul.ac.uk Current Opinion in Critical Care 2007, 13:294–296 Abbreviation ITD intermittent thermodilution ß 2007 Lippincott Williams & Wilkins 1070-5295 The assessment of cardiac output in haemodynamically unstable patients is considered to be useful by many practising intensivists. Knowledge of this variable is unlikely to inﬂuence clinical outcomes, however, unless it leads to a beneﬁcial change in therapy. Many authors have tried to delineate exactly how cardiac output should be used in this group of patients in order to direct management strategies. Consensus has not been easy to achieve, and the issue remains contentious. In order for any clinical protocol to have any chance of success, the technology used to measure the variable has to be robust, accurate and precise. If there is no conﬁdence in the monitored variable, then it is unlikely that the clinical protocol will be widely accepted into practice. Over the last few years, there have been a number of new devices that have been marketed by our industrial partners that attempt to measure and monitor cardiac output. With these new devices have been a series of published studies assessing the accuracy of the new tool. Unfortunately, our ability to design validation studies and analyse the accrued data is ham- pered by the lack of a good ‘gold standard’ reference. This limits our ability to understand the new data and subsequently assess the efﬁcacy of the new technologies. Although the pulmonary artery catheter is considered by many to be yesterday’s technology [1], it remains the most widely accepted method of measuring cardiac output at the bedside. Properly performed averaged triplicate or quadruplicate intermittent thermodilution (ITD) is the nearest we have to a ‘gold standard’ reference technique [2,3]. Nearly all validation studies for new devices therefore initially attempt to validate the new technology against ITD. The problem with this is that ITD is not in itself completely accurate – it has its own level of imprecision. The studies are therefore comparing two devices that measure the same variable but each with their own level of inaccuracy. This makes understanding and interpreting the results very difﬁcult. Bland and Altman [4–6] proposed a method of graphically assessing this data, and this analysis is now widely used in these studies. Bland and Altman [4–6] ﬁrst suggested their statistical method for assessing agreement between two methods of clinical measurement of the same physiological variable in The Lancet in 1986. Their contention was that when measuring a physiological variable (such as cardiac output), we are often unable to measure that variable directly (because it is either too dangerous or too invasive) so we are forced to measure it using a system that is often not completely accurate. Thus, when we compare the results of measurement of the same variable using different technological methods, it is very important to know whether the newer method agrees sufﬁciently with the older (usually the established reference method) to be able to be introduced into clinical practice. Bearing in mind that neither method can be absolutely guaranteed to provide unequivocally correct measurements, it becomes neces- sary to provide an assessment of the degree of agreement between the two methods. Bland and Altman’s [4–6] seminal work suggested that the most appropriate way to assess the degree of agreement between the two methods was graphically by the use of the ‘Bland Altman Plot’. This provided three main pieces of information: the bias (the average of all the differences); the standard deviation around the bias (SD); and the limits of agreement [the limits within which 95% of all the points fall on either side of the bias that is 1.96 (2) SD from the bias]. In many of these validation studies, it has been common to ﬁnd that the bias between the two comparators showed very close agreement, but that the limits of agreement were wide, suggesting that the new device may not be accurate enough to replace the older technique. The problem has been in assessing how wide or narrow these limits of agreement need to be for the new technique to be considered acceptable for clinical practice [7,8]. Recent published studies [9–19] have shown marked disparity between what the authors conclude from their results, with some authors concluding that 2 l/min is appropriate and others concluding the opposite. It is rare for authors of these studies to describe up front in the methods section of 294