Evaluating Effects of Treatment in Subgroups of Patients Within a Clinical Trial: The Case of Non- &Wave Myocardial Infarction and Beta Blockers Salim Yusuf, MRCP, DPhil, Janet Wittes, PhD, and Jeffrey Probstfield, MD M ost medical researchers believe that randomized clinical trials are the best means of evaluating the effects of a treatment on outcomes in a particular disease. l Randomized clinical trials are par- ticularly important when the plausible effect is only moderate, e.g., a 15, 20 or 25% reduction in the risk of developing a major adverse outcome such as death, re- infarction or stroke.2m4 In order to detect moderate treatment effects reliably, the errors inherent in the clinical trial must be relatively small. Two sources of errors, systematic biases and random errors, occur and both should be minimized. Both these “errors” affect the reliable detection of treatment effects in a trial as a whole or in subgroups within the trial.2 Systematic biases in a single trial are avoided by the allocation of patients to active treatment or control by using strict randomization (not by alternating, odd-even dates or any other method that allows foreknowledge of treatment assignment). Further, analyses should include all randomized patients and the results should be re- ported based upon rules established in the protocol or before knowledge of the results. Random errors are chiefly avoided by having studies of sufficient size and by combining the results of several related trials. For most common treatments of interest in cardiovascular disease, because the plausible range of effects is only about 15 or 20%, often several hundred to about a thou- sand events are needed to reach reliable conclusions.*s4 Even when the above criteria are satisfied, we are only in a position to provide reliable answers regarding the average effects in the overall trial, but not about the effects in specific subgroups. How, then, should one ap- ply the results of a trial to specific subsets of patients, each subset being only a part of the overall data? In this commentary, we will point out the following: the treat- ment effect is likely to be qualitatively similar (i.e., in the same direction) in all subgroups of patients without obvious contraindications to treatment but is also likely to be quantitatively dissimilar (differences in degree of effect) even when the effects appear to be identical; and estimates of treatment effect within a subgroup chosen for special emphasis are usually “biased” and so the From the Clinical Trials Branch, National Heart, Lung, and Blood Institute, Bethesda, Maryland and the Veterans Administration Coop- erative Studies Group, West Haven, Connecticut. Manuscript received and accepted April 18, 1990. Address for reprints: Salim Yusuf, MRCP, DPhil, Clinical Trials Branch, Division of Epidemiology and Clinical Applications, National Heart, Lung, and Blood Institute, Bethesda, Maryland 20892. most appropriate estimate in a subgroup is closer to the overall result. Low likelihood of differences in kind (qualitative in- teraction) but higher likelihood of differences in degree (quantitative interaction): Patients who are thought to be clearly benefited or definitely harmed by a given treatment are usually not entered into trials. Therefore, a priori, trials exclude the “extremes” expected on bio- logic and pharmacologic grounds. The low likelihood of qualitative interactions in a trial is supported by the ex- periences from cardiovascular clinical trials conducted in the last 3 decades.3,4 Claims made in individual trials for apparent qualitative interactions between various subgroups have not been replicated in further studies. For example, Andersen et al5 claimed that long-term p blockade is beneficial among patients <65 years of age and harmful in those older.5 The international practolol study claimed that treatment was only beneficial among those with anterior infarction, but not among those with inferior infarction.6 In both cases, subsequent studies showed benefit in the elderly and in those with inferior myocardial infarction (MI). Another example concerns the evaluation of thrombolytic agents in acute MI. Many investigators were so convinced that treatment >6 hours would be of little benefit that several studies excluded such patients (i.e., a strong prior expectation of a qualitative interaction). The available data suggest, however, that even delayed treatment provides about two-thirds of the benefit of earlier treatment (a quanti- tative interaction).3l7 Biases and errors in detecting subgroup effects within a trial: STATISTICAL POWER: In a trial designed with power adequate to detect a given difference in the overall trial, the power to detect similarly sized differ- ences within the various subgroups is substantially low- er. The smaller the subgroup, the lower the power. For example, the /I Blocker Heart Attack Trial (BHAT), which randomized about 4,000 patients, was designed to have 90% power to detect an overall reduction in mor- tality of 25%~~ To have 90% power of detecting a similar effect in a specific subset of patients (with similar event rates), one would need 4,000 patients in each subset of interest. Conversely, in a trial such as BHAT, which provides a clear result overall, even if the effect is identi- cal in several large subsets, random variation may exag- gerate or dilute the effects so that some subgroups may spuriously appear to have a large effect and others no effect or even a harmful effect. A practical demonstra- tion of this point is an analysis reported by the Second 220 THE AMERICAN JOURNAL OF CARDIOLOGY VOLUME 66