Copyright © Lippincott Williams & Wilkins. Unauthorized reproduction of this article is prohibited.
EDITORIAL REVIEW
Methodologies for assessing agreement between two methods
of clinical measurement: are we as good as we think we are?
Maurizio Cecconi, Michael Grounds and Andrew Rhodes
Department of Intensive Care, St George’s Hospital, London, UK
Correspondence to Dr Andrew Rhodes, Department of Intensive Care, St George’s
Hospital, London, SW17 0QT, UK
Tel: +44 208 725 0884; fax: +44 208 725 0879; e-mail: andyr@sgul.ac.uk
Current Opinion in Critical Care 2007, 13:294–296
Abbreviation
ITD intermittent thermodilution
ß 2007 Lippincott Williams & Wilkins
1070-5295
The assessment of cardiac output in haemodynamically
unstable patients is considered to be useful by many
practising intensivists. Knowledge of this variable is
unlikely to influence clinical outcomes, however, unless
it leads to a beneficial change in therapy. Many authors
have tried to delineate exactly how cardiac output should
be used in this group of patients in order to direct
management strategies. Consensus has not been easy
to achieve, and the issue remains contentious.
In order for any clinical protocol to have any chance of
success, the technology used to measure the variable has
to be robust, accurate and precise. If there is no
confidence in the monitored variable, then it is unlikely
that the clinical protocol will be widely accepted into
practice. Over the last few years, there have been a
number of new devices that have been marketed by
our industrial partners that attempt to measure and
monitor cardiac output. With these new devices have
been a series of published studies assessing the accuracy
of the new tool. Unfortunately, our ability to design
validation studies and analyse the accrued data is ham-
pered by the lack of a good ‘gold standard’ reference.
This limits our ability to understand the new data and
subsequently assess the efficacy of the new technologies.
Although the pulmonary artery catheter is considered by
many to be yesterday’s technology [1], it remains the most
widely accepted method of measuring cardiac output at
the bedside. Properly performed averaged triplicate or
quadruplicate intermittent thermodilution (ITD) is the
nearest we have to a ‘gold standard’ reference technique
[2,3]. Nearly all validation studies for new devices
therefore initially attempt to validate the new technology
against ITD. The problem with this is that ITD is not
in itself completely accurate – it has its own level of
imprecision. The studies are therefore comparing two
devices that measure the same variable but each with their
own level of inaccuracy. This makes understanding and
interpreting the results very difficult. Bland and Altman
[4–6] proposed a method of graphically assessing this data,
and this analysis is now widely used in these studies.
Bland and Altman [4–6] first suggested their statistical
method for assessing agreement between two methods of
clinical measurement of the same physiological variable
in The Lancet in 1986. Their contention was that when
measuring a physiological variable (such as cardiac
output), we are often unable to measure that variable
directly (because it is either too dangerous or too
invasive) so we are forced to measure it using a system
that is often not completely accurate. Thus, when we
compare the results of measurement of the same variable
using different technological methods, it is very
important to know whether the newer method agrees
sufficiently with the older (usually the established
reference method) to be able to be introduced
into clinical practice. Bearing in mind that neither
method can be absolutely guaranteed to provide
unequivocally correct measurements, it becomes neces-
sary to provide an assessment of the degree of agreement
between the two methods. Bland and Altman’s [4–6]
seminal work suggested that the most appropriate way to
assess the degree of agreement between the two methods
was graphically by the use of the ‘Bland Altman Plot’.
This provided three main pieces of information: the bias
(the average of all the differences); the standard deviation
around the bias (SD); and the limits of agreement [the
limits within which 95% of all the points fall on either
side of the bias that is 1.96 (2) SD from the bias].
In many of these validation studies, it has been common to
find that the bias between the two comparators showed
very close agreement, but that the limits of agreement
were wide, suggesting that the new device may not
be accurate enough to replace the older technique. The
problem has been in assessing how wide or narrow these
limits of agreement need to be for the new technique to
be considered acceptable for clinical practice [7,8]. Recent
published studies [9–19] have shown marked disparity
between what the authors conclude from their results, with
some authors concluding that 2 l/min is appropriate and
others concluding the opposite. It is rare for authors of
these studies to describe up front in the methods section of
294