CLASSICAL AND PREDICTION APPROACHES TO ESTIMATING DISTRIBUTION FUNCTIONS FROMSURVEYDATA Lynn Kuo, University of Connecticut U-120, Storrs, CT 06268 KEY WORDS: auxiliary information, design-based approach, finite population distribution function, model based approach, nonparametric regression. Abstract Both classical and superpopulation approaches to estimating finite population distribution functions are considered. For the superpopulation approach, nonparametric regression methodology is applied to predict the finite population distribution when auxiliary information is available. Some comparisons are made for the estimators by Monte Carlo methods. 1. Introduction Estimation of the finite population distribution function from survey data with a design-based approach has received some attention recently. Sedransk and Sedransk (1979) illustrate the usefulness of the sample cumulative distribution functions (CDFs) for stratified designs in making comparisons among subpopulations. Cohen and Kuo (1985a and b) study the properties of the sample CDF from a decision theoretical point of view. They show the sample CDF is admissible for estimating the population distribution function for a class of loss functions with any fixed size sample design. For each of the loss functions, they show that the simple random sampling combined with a step function estimator is the minimax strategy. Francisco and Fuller (1986) study the large sample properties of the sample CDF from stratified cluster samples. Kuk (1988) evaluates the mean squared errors of the Horvitz and Thompson estimator for the distribution function and other related estimators. Model-based approach to estimating a distribution function has also been studied. Binder (1982) proposes a nonparametric Bayesian approach to estimating the finite population distribution function for simple random sampling and strratified designs. Chamber and Dunstan (1986) propose an estimator when auxiliary information is available. The variable of interest Y is assumed to be related to the auxiliary variable X by a regression function through origin with heteroscedastic errors. This paper focuses on the finite population distribution function when auxiliary information is available. The regression assumption used by Chamber and Ounstan is relaxed. Let us assume that the finite population consists of N ordered pairs (X.,¥.) generated from a bivariate • . .1 1 distribution P. The finite population joint distribution function is defined by N =E I[Xi < s, Yi < t]. r(s,t) : i 1 - - The finite population distribution function (of the Y variable) is defined by Fy(t) N = F~ I[Yi -< t]. i=l Let us assume we observe all the X variables {Xi} N i=1 and a sample of size n of the Y variables by a design. The objective is to estimate F(s,t) and Fy(t) given the sample. Several nonparametric regression estimators are proposed. These regression-type estimators include the naive weighting, the kernel, and the nearest neighbor estimators. The method proposed here makes use of a nonparamet r i c superpopu 1 at i on mode1. Consequently it is a nonparametric model based approach. The finite population F(s,t) is generated by a b ivariate distribution P. The superpopulation P is usually the object of inference in nonparametric density estimation. Stone (1977) and Silverman (1985) provide more detailed discussion in this area, where a sample of size n, (Xi,Yi), i = 1,. • n is chosen from the bivariate distribution P ' Since we observe all the X variables in the finite population, we can assume that the {Xi} i m = n+l,...,N are random variables chosen fr..., the X marginal distribution Py. In addition, we observe the ordered pairs "" {Xi,Yi }' i = 1,...,n, from the bivariate distribution P. Cohen and Kuo (1988) derive the nonparametric generalized maximum likelihood estimator, nonparametric Bayesian estimator and histogram estimator of P. The predictor of F is studied in this paper by means of the nonparametric regression method. This method has at least three advantages. (1) It is nonparametric. Therefore, it alleviates survey statisticians of the burden of selecting a parametric model for Pe (~ It incorporates the information from t auxi mary variable X by means of the superpopulation P. t 3) It adapts the amount of smoothing to the ocal density dP. Our primary interest is to predict the finite population distribution function F and marginal distribution Fy. Other parameters of interest N (for example, the population total Y = E Yi or i=1 N N the ratio R =i=lE Yi/1ElXi). can also be predicted using the predictor Fy(t) and F(s,t). These predictors are also stuHied in this paper. Nonparametric predictors of the distributions F, F v and their functionals Y and R are given in Section 2. Monte Carlo results are given in Section 3. 2. Nonparametric Regression Estimators Let us recall the data consist of n completely observed ordered pairs (Xi,Yi) , i = 1,...,n, and additional X i values i = n+l,...,N. Two predictors of F(s,t) can be obtained. 280