J clin Epidem&l Vol. 46, No. 9, pp. 1055-1062, 1993 0895-4356/93 $6.00 + 0.00 Printed in GreatBritain. All rights -cd Copyright 0 1993 Pergamon Press Ltd THE ANALYSIS OF ORDINAL AGREEMENT BEYOND WEIGHTED KAPPA DATA: PATRICK GRAHAM’ and RODNEY JACKSON* ‘Department of Community Health and General Practice, Christchurch School of Medicine, P.O. Box 4345,, Christchurch, New Zealand and %epartmtit of Community Health, University of Auckland, Auckland, New Zealand (Received 3 December 1992; received for publication 14 April 1993) Abstract-The weighted kappa statistic has been used as an agreement index for ordinal data. Using data on the comparability of primary and proxy respondent reports of alcohol drinking frequency we show that the value of weighted kappa can be sensitive to the choice of weights. The distinction between association and agreement is clarified and it is shown that in some respects weighted kappa behaves more like a measure of association than an index of agreement. In particular, it is demonstrated that the weighted kappa statistic is not always sensitive to differences in the observed proportion in exact agreement and that high values of weighted kappa can be observed even when the level of agreement is low. We illustrate the use of statistical models in the analysis of epidemiologic agreement data and conclude that modelling ordinal agreement data produces insights which cannot he obtained through the use of weighted kappa statistics. Kappa Agreement Epidemiologic methods INTRODUCI’ION Studies of the reliability of epidemiologic survey instruments and of observer reliability usually involve analysis of agreement amongst paired measurements. In clinical research the question of diagnostic agreement has also received con- siderable attention [l]. In both these situations the study of agreement is the main issue and we follow Becker in describing such studies as agreement studies [2]. The natural representation of categorical agreement data is a two-way table such as Table 1, which is a cross-classification of pri- mary respondent and proxy reports of alcohol drinking frequency. The data reported in Table 1 were drawn from the control series of the Auckland Heart Study, a community based case-control study of coronary heart disease. In this study, a randomly selected sub-sample of the non-fatal myocardial infarction controls (primary respondents) were asked if their next of kin (proxy respondents) could also be inter- viewed about them (the myocardial infarction controls) [3]. In this paper we focus on methods for analysing agreement on the classification of individuals rather than the issue of agreement between the marginal distributions. A natural measure of the degree of individual level agreement is the probability that any random selection from a set of paired measurements yields a pair who are in exact agreement. In the analysis of categorical agreement data it has become customary to use the kappa statistic [4] which discounts the observed proportion of all pairs in exact agreement by the proportion expected by chance. The proportion of pairs in agreement expected by chance is the proportion expected if the two measurements or reports are, in fact, made independently of one another. The kappa statistic is often referred to as a measure of chance corrected agreement or agreement beyond chance. When the data to be analysed are measured on an ordered categorical scale, the weighted 1055