Computers and Chemical Engineering 26 (2002) 17 – 39 A method of robust multivariate outlier replacement K.A. Hoo a, *, K.J. Tvarlapati b , M.J. Piovoso c , R. Hajare d a Department of Chemical Engineering, Texas Tech Uniersity, PO Box 43121, Lubbock, TX 79409 -3121, USA b Department of Chemical Engineering, Uniersity of South Carolina, Columbia, SC 24208, USA c School of Graduate Professional Studies, Penn State Uniersity, Malern, PA 19355, USA d Core Engineering Specialists, Exxon Chemical Company, Baytown, TX 77522 -4900, USA Received 25 October 2000; received in revised form 17 August 2001; accepted 17 August 2001 Abstract Robust multivariate methods for dealing with problems caused by outliers in the data are essential especially when process data are used to validate mechanistic models, develop regression models, and in applications such as controller design and process monitoring. Gross outliers are detected easily by simple methods such as range checking, however, a multivariate outlier is very difficult to discern and techniques that rely on data to generate empirical models may produce erroneous results. In this work, a methodology to perform multivariate outlier replacement in the score space generated by principal component analysis (PCA) is proposed. The objective was to find an accurate estimate of the covariance matrix of the data so that a PCA model might be developed that could then be used for monitoring and fault detection and identification. The methodology uses the concept of winsorization to provide robust estimates of the mean (location) and S.D. (scale) iteratively, yielding a robust set of data. The paper develops the approach, discusses the concept of robust statistics and winsorization, and presents the procedures for robust multivariate outlier filtering. One simulated and two industrial examples are provided to demonstrate the approach. © 2002 Elsevier Science Ltd. All rights reserved. Keywords: Principal component analysis; Multivariate outliers; Winsorizing; Location; Scale; MADM www.elsevier.com/locate/compchemeng 1. Introduction The data obtained from any process can be used for model development, monitoring and control, and parameter estimation to name a few. Outliers, in gen- eral, represent data elements that either are irrelevant, grossly erroneous or abnormal in some other way, compared with the majority of the data. They can be caused by sensor faults or failure that are inherent in any data set. These outliers may lead to incorrect conclusions, if the data are analyzed without account- ing for their effects. Hence, it is important that these erroneous observations be given less importance. Uni- variate techniques for outlier detection should not be applied to multivariate data because of the existence of correlations between the variables (Huber, 1981). The samples in a single variable, which appear as outliers when analyzed univariately may not appear as outliers when the variability of the other variables is considered. Sometimes multivariate outliers cannot be detected when the variables are analyzed univariately. The extensive use of computers has created data overload and since information from these data can be used to enhance process knowledge it is necessary to extract useful information in a robust way. Data from chemical process industries are naturally correlated and multivariate techniques that analyze these measure- ments have gained substantial importance. Principal component analysis (PCA), which is one such tech- nique, has tremendous potential in many disciplines in a variety of applications. In the chemical industry, PCA has been used to develop models for applications such as process monitoring, fault detection and identification (Bharati & MacGregor, 1998; Dunia & Qin, 1998). Such models depend on an accurate estimate of the correlation matrix of the data. However, the presence of outliers in the data can have tremendous influence on the resulting estimates (Martens & Naes, 1979). * Corresponding author. E-mail address: khoo@coe.ttu.edu (K.A. Hoo). 0098-1354/02/$ - see front matter © 2002 Elsevier Science Ltd. All rights reserved. PII: S0098-1354(01)00734-7