A tree based procedure for multivariate imputation Una Procedura ad Albero per l’Imputazione Multivariata Riccardo Borgoni Southampton Statistical Sciences Research Institute, University of Southampton, borg1@socsci.soton.ac.uk Ann Berrington Division of Social Statistics, University of Southampton, amb6@socsci.soton.ac.uk Riassunto: In questo lavoro proponiamo una procedura di imputazione multivariata basata sull’applicazione iterata di sequenze di modelli ad albero. Un’applicazione al rischio di essere fumatori esemplifica la procedura proposta e mostria come questa possa essere integrata con metodi di tipo bootstrap e di imputazione multipla per stimare la variabilità delle stime tenendo conto del processo di non risposta. Keywords: Classification Tree, Multivariate Imputation, Bootstrap, Multiple Imputation 1. Introduction In this paper we propose a tree based approach for multivariate imputation. We demonstrate through a case study how computer intensive methods can be used to address the extra variability in regression coefficient estimates due to the procedure 1 . A tree model (Breiman et al 1984) provides the conditional distribution of an outcome variable Y given a vector of predictors X. Those models are fully described by the pair (T,θ θ θ) where T is a tree with R terminal nodes and θ θ θ={θ 1 ,…, θ R } is a vector of parameters such that θ r is associated to the leaf r for r=1,…,R. If X lies in the region corresponding to the leaf r, then Y| X is distributed according to a probability law p(Y|θ r ). A binary tree partitions the predictor space into R subsets identified by the terminal nodes of the tree. A splitting rule is applied to each internal node using the predictors to allocate every unit to its left or right child node. We consider the case where both the predictor and all of the regressors are categorical, which is a common situation in applications based on survey data. In the categorical predictor case, the splitting rule classifies a subject i into, say, the left child node if {X i ∈C} C being a subset of categories. For a categorical response variable, each terminal node identifies a multinomial law of parameter θ θ θ r where θ θ θ r is the vector of multinomial probabilities θ rj =Pr{y∈j|r}, j∈ℑ, ℑ being the set 1 We thank Ray Chambers for his comments. Any errors remain the fault of the authors. We acknowledge the financial support of the Department of Health for England grant number 0370020. The views expressed here are those of the authors and not necessarily the Department of Health. Data from the 1970 British Cohort Study were made available via the UK Data Archive.