Computers and Chemical Engineering 26 (2002) 17 – 39
A method of robust multivariate outlier replacement
K.A. Hoo
a,
*, K.J. Tvarlapati
b
, M.J. Piovoso
c
, R. Hajare
d
a
Department of Chemical Engineering, Texas Tech Uniersity, PO Box 43121, Lubbock, TX 79409 -3121, USA
b
Department of Chemical Engineering, Uniersity of South Carolina, Columbia, SC 24208, USA
c
School of Graduate Professional Studies, Penn State Uniersity, Malern, PA 19355, USA
d
Core Engineering Specialists, Exxon Chemical Company, Baytown, TX 77522 -4900, USA
Received 25 October 2000; received in revised form 17 August 2001; accepted 17 August 2001
Abstract
Robust multivariate methods for dealing with problems caused by outliers in the data are essential especially when process data
are used to validate mechanistic models, develop regression models, and in applications such as controller design and process
monitoring. Gross outliers are detected easily by simple methods such as range checking, however, a multivariate outlier is very
difficult to discern and techniques that rely on data to generate empirical models may produce erroneous results. In this work, a
methodology to perform multivariate outlier replacement in the score space generated by principal component analysis (PCA) is
proposed. The objective was to find an accurate estimate of the covariance matrix of the data so that a PCA model might be
developed that could then be used for monitoring and fault detection and identification. The methodology uses the concept of
winsorization to provide robust estimates of the mean (location) and S.D. (scale) iteratively, yielding a robust set of data. The
paper develops the approach, discusses the concept of robust statistics and winsorization, and presents the procedures for robust
multivariate outlier filtering. One simulated and two industrial examples are provided to demonstrate the approach. © 2002
Elsevier Science Ltd. All rights reserved.
Keywords: Principal component analysis; Multivariate outliers; Winsorizing; Location; Scale; MADM
www.elsevier.com/locate/compchemeng
1. Introduction
The data obtained from any process can be used for
model development, monitoring and control, and
parameter estimation to name a few. Outliers, in gen-
eral, represent data elements that either are irrelevant,
grossly erroneous or abnormal in some other way,
compared with the majority of the data. They can be
caused by sensor faults or failure that are inherent in
any data set. These outliers may lead to incorrect
conclusions, if the data are analyzed without account-
ing for their effects. Hence, it is important that these
erroneous observations be given less importance. Uni-
variate techniques for outlier detection should not be
applied to multivariate data because of the existence of
correlations between the variables (Huber, 1981). The
samples in a single variable, which appear as outliers
when analyzed univariately may not appear as outliers
when the variability of the other variables is considered.
Sometimes multivariate outliers cannot be detected
when the variables are analyzed univariately.
The extensive use of computers has created data
overload and since information from these data can be
used to enhance process knowledge it is necessary to
extract useful information in a robust way. Data from
chemical process industries are naturally correlated and
multivariate techniques that analyze these measure-
ments have gained substantial importance. Principal
component analysis (PCA), which is one such tech-
nique, has tremendous potential in many disciplines in
a variety of applications. In the chemical industry, PCA
has been used to develop models for applications such
as process monitoring, fault detection and identification
(Bharati & MacGregor, 1998; Dunia & Qin, 1998).
Such models depend on an accurate estimate of the
correlation matrix of the data. However, the presence
of outliers in the data can have tremendous influence
on the resulting estimates (Martens & Naes, 1979).
* Corresponding author.
E-mail address: khoo@coe.ttu.edu (K.A. Hoo).
0098-1354/02/$ - see front matter © 2002 Elsevier Science Ltd. All rights reserved.
PII: S0098-1354(01)00734-7