Generalisability: a key to unlock professional assessment Jim Crossley, 1 Helena Davies, 2 Gerry Humphris 3 & Brian Jolly 4 Context Reliability is defined as the extent to which a result reflects all possible measurements of the same construct. It is an essential measurement characteristic. Unfortunately, there are few objective tests for the most important aspects of the professional role because they are complex and intangible. In addition, professional performance varies markedly from setting to setting and case to case. Both these factors threaten reliability. Aim This paper describes the classical approach to evaluating reliability and points out the limitations of this approach. It goes on to describe how generalisability theory solves many of these limitations. Conditions A G-study uses variance component analysis to measure the contributions that all relevant factors make to the result (observer, situation, case, assessee and their interactions). This information can be com- bined to reflect the reliability of a single observation as a reflection of all possible measurements – a true reflec- tion of reliability. It can also be used to estimate the reliability of a combined sample of several different observations, or to predict how many observations are required with different test formats to achieve a given level of reliability. Worked examples are used to illustrate the concepts. Keywords educational measurement *standards; education, medical undergraduate *standards; professional competence *standards; observer variation; reproducibility of results. Medical Education 2002;36:972–978 Introduction This paper is the second in a series on professional assessment. The first article highlighted the importance of well-designed assessment in regulation and training, and laid out the general principles of assessment methodology. Reliability was identified as an important measurement characteristic. Regulatory decisions are increasingly challenged because the tools on which they are based are not reliable. 1 The reliability of a given measurement is the extent to which it reflects all possible measurements of the same construct. Achieving reliability is a partic- ular challenge in professional assessment for two reasons: 1 The professional role is made up of complex behav- iours. Apparently ÔobjectiveÕ methods such as know- ledge tests do not reflect the richness of these behaviours; they lack authenticity. Equally, it is difficult to reduce these behaviours to a checklist of observable processes. Attempts to measure them directly depend in part on subjective judgements about performance, and it is a substantial effort to measure and to ensure the reliability of such judge- ments. 2 Professional behaviour is highly dependent upon the nature and details of the problem being faced. The same attribute may be demonstrated more or less clearly in different settings or different test observa- tions (e.g. different cases). This is known as case- specificity. 2 In this paper we examine the classical approach to quantifying and controlling reliability, and show how generalisability theory offers an important extension to this approach. The basic principles of the theory are described here, but a newcomer wishing to apply the technique should refer to the cited texts or seek advice from a colleague with prior experience. The complexity of the theory is widely acknowledged, and there is limited experience amongst statisticians. 3 1 Department of Paediatric Medicine, Sheffield Children’s Hospital, Sheffield, UK, 2 Medical Education, Sheffield Children’s Hospital, Sheffield, UK, 3 University Department of Psychiatry, Manchester Royal Infirmary, Manchester, UK, 4 Centre for Medical and Health Sciences Education, Faculty of Medicine, Monash University, Victoria, Australia Correspondence: Jim Crossley, Department of Paediatric Medicine, Sheffield Children’s Hospital, Western Bank, Sheffield S10 2TH, UK. E-mail: j.crossley@sheffield.ac.uk The metric of medical education 972 Ó Blackwell Science Ltd MEDICAL EDUCATION 2002;36:972–978