STATISTICAL TESTS BASED ON TRAINING SAMPLES I. G. Bairamov and Yu. I. Petunin UDC 519.21 We propose consistent tests of statistical hypotheses based on training samples. Unbiased consistent estimators of the probability of no decision are obtained. Many problems in natural and social sciences and in engineering require the construction of tests for statistical hypotheses. The classical theory of hypothesis testing was created in the first half of the 20th century [1-3]. A major component of this theory is the Neymann-Pearson lemma [3] which produces the most powerful test for the simple hypotheses H 0 and H 1. This test can be constructed, and its type I and II errors calculated, if we know the exact sample distribution functions F0(ul,...,Un) and Fl(Ul,...,Un) under the hypotheses H 0 and H 1. Knowledge of the hypothetical distribution functions Fj(nl,...,un) (j = 1,2,...,r) of the sample Xl,...~x n under the competing hypotheses Hj (j = 1,2,...,r) is necessary for the construction of virtually all tests in the classical theory of hypothesis testing, and also for estimating the error probabilities associated with false decisions when the hypothesis Hj is actually true. Unfortunately, the functions Fj(Ul,...,Un) are known only in exceptional cases in practice and the researcher as a rule does not have complete information about the distribution functions Fj(Ul,...,un), j = 1,...,r. This precludes direct (i.e., without some modifications) application of the classical theory of hypothesis testing in statistical practice, whereas replacement of the hypothetical distribution functions Fj(ul,...,Un) with empirical functions Fj*(Ul,...,Un) based on sample values leads to considerable difficulties, which so far have not been resolved. It is thus necessary to construct tests for hypotheses based on training samples, and not on the functions Fj(Ul,...,un), and these tests should be designed so as to enable us to compute the probabilities of type I and type II errors. This formulation of the problem of test construction is typical of the statistical theory of pattern recognition. It substantially differs from the classical hypothesis testing problem and better fits the practical needs. The purpose of this study is to describe a number of tests based on training samples and to compute the probabilities of type I and type II errors, as well as the probability of no decision. Let G o and G 1 be populations with unknown distribution functions Fx(u ) and Fy(u) respectively; Xl,...,x n and Yl,-.',Ym training samples (templates) from these populations. Denote by x(u _< ... ___X(n ) and Y(1) --- ... -< Y(m) the ordered samples constructed from the original samples (Xl,...,Xn) = x and (Yl,'",Ym) = Y, respectively. Then x(i) and Y(i) are the order statistics for these samples. Suppose that the sample Zl,...,zk consisting of finitely many sample values is drawn by simple random sampling from an unknown population G i (i = 0, 1). It is required to construct tests that determine the index i of this population (or, equivalently, identify the population G i from which the sample zl,...,z k was drawn). Denote by H i (i = 0, 1) the hypothesis that z = (Zl,...,Zk) C G i (i = 0, 1). We thus obtain the null hypothesis that the sample Zl,...,z k has the hypothetical distribution F(u) = Fx(U); then the alternative hypothesis is H 1 = {F(u) = Fy(U)}. Assume that the order statistics x(1), X(n ), Y0)' Y(m) satisfy the inequality xm ~<Yc~) ~< x(.)~< Y(.o" (1) For this arrangement of the order statistics, we will construct a finite family of tests for the truth of the hypothesis H i based on training samples, for which we can exactly determine the probabilities of type I and type II errors and also compute an unbiased and consistent estimator of the probability of no decision (a different ordering of the statistics requires an obvious modification of these tests). Thus, in what follows we assume that the extreme order statistics of the samples x and y satisfy the ordering (1). Translated from Kibernetika, No. 3, pp. 74-77, May-June, 1991. Original article submitted February 15, 1988. 408 0011-4235/91/2703-(~408512.50 9 Plenum Publishing Corporation