Clinical measures: reliable or not? SIRÐWe were interested to read the paper by Gregson and colleagues [1] in the May edition of Age and Ageing, not just because we have a common interest in improving the measurement of post-stroke impairment but also because their ®ndings appear to contradict our own [2]. Although the ®ndings of this study in¯uence our inter- pretation of the published reliability data for the Modi®ed Ashworth Scale MAS), the problem of the measurement of low tone, or ¯accidity, remains. Moreover it is interesting that the `unweighted' k values were similar to those we found for the three-category scale, spastic/normal/¯accid, which we tested, and to the values found in other studies of the MAS [3, 4]. Gregson and colleagues [1] employed the weighted k on the justi®cation that ``a difference of one point on each of the scales would not be considered clinically signi- ®cant''. This statement and their ®ndings suggest that, although the MAS may be reliable when a standardized assessment procedure is used, it might still be a rather blunt instrument to measure change in response to an intervention. The use of the weighted k statistic by Gregson and colleagues [1] is statistically valid, but we are concerned that it may have given an in¯ated impression of the real clinical reliability of the MAS. In their seminal paper on reliability, Bland and Altman [5] argue cogently that agreement is the central issue. More recently they have pointed out that reliability is not a statistical issue but a clinical decision [6]: statistical methods should be used to inform but not direct those clinical decisions. We agree with the sentiment and are concerned that the weighted k statistic, despite its excellent statistical pedi- gree, can act contrary to it, converting poor clinical agreement into high statistical agreement. For example, the two assessors rated the muscle tone of the knee on the MAS differently for 47% of patients, but the weighted k was good 0.73) [1]. If all or most of the disagreements were only by one scale point, clinical reliability probably exists, and this weighted k value re¯ects that. However, if a reasonable proportion were by two points or more, the weighted k value does not re¯ect the clinical reality. VALERIE M. POMEROY,BRIAN FARAGHER,RAY C. TALLIS The Stroke Association's Therapy Research Unit, Clinical Sciences Building, Hope Hospital, Eccles Old Road, Salford M6 8HD, UK Fax: q44) 161 787 5722 Email: vpomeroy@fs1.ho.man.ac.uk 1. Gregson JM, Leathley MJ, Moore P et al. Reliability of measurements of muscle tone and muscle power in stroke patients. Age Ageing 2000; 29: 223±8. 2. Pomeroy VM, Dean D, Sykes L et al. The unreliability of clinical measures of muscle tone: implications for stroke therapy. Age Ageing 2000; 29: 229±33. 3. Allison SC, Abraham LD, Petersen CL. Reliability of the Modified Ashworth Scale in the assessment of plantar flexor muscle spastcity in patients with traumatic brain injury. Int J Rehabil Res 1996; 19: 67±78. 4. Haas BM, Bergstrom E, Jamous A et al. The inter-rater reliability of the original and of the Modified Ashworth Scale for the assessment of spasticity in patients with spinal cord injury. Spinal Cord 1996; 34: 560±4. 5. Bland J, Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet 1986; I: 307±10. 6. Bland JM, Altman DG. Measuring agreement in method comparison studies. Stat Methods Med Res 1999; 8: 135±60. SIRÐWe were interested to see the juxtaposition of our article with that of Pomeroy et al. in Age and Ageing, volume 29, number 3 [1]. We were further interested to read this group's letter and the editorial by Ward [2]. We were pleased that we had a positive in¯uence on Pomeroy and colleagues' interpretation of published data. It is clear that they and we have read the same articles published up to March 1999 on measurement of tone. However, we have drawn different conclusions and have gone down a different experimental path, leading to several publications which support the reliability of clinical measures [3, 4]. While we agree that the modi®ed Ashworth scale is of no value in measuring low tone, we do not agree that previous empirical evidence is unequivocal in showing it to be unreliable in measuring normal and increased tone. This is most evident in the work of Bohannon and Smith [5], dismissed by Pomeroy et al. as having used the wrong statistics. The article contains the raw data, and if the appropriate statistics are used i.e. k [6] or weighted k [7]), we see very good agreement, k = 0.83 and k w = 0.98. Pomeroy et al. question the use of k and weighted k statistics, quoting an example of our work in which muscle tone at the knee was rated differently by the two raters in 47% of patients but with a k of 0.73, and arguing that these differences may be of two or more points. However, use of k with quadratic weights, whilst giving partial credit for ratings which differ by one point on the scale, gives only minimal credit for ratings which differ by two points or more. Thus, a high k value implicitly re¯ects that most differences were of only one point. We would be happy to provide our raw data to anyone who may be interested. Furthermore, we believe that a difference of one point in a given patient would not immediately be considered as a de®nite and clinically relevant change, but would rather be noted as a possible change and reassessed on another occasion. Thus, we are not consid- ering reliability as merely a statistical issue, but rather we would use it as a guide to our clinical practice. We are puzzled why Pomeroy et al. did not train their raters in use of the tools or indeed why they suggest training should decrease reliability. It is a tenet of research that standardization of methods will reduce Letters to the Editor 86