Chemical Engineering and Processing 38 (1999) 477 – 486
Considering precision of experimental data in construction of
optimal regression models
Mordechai Shacham
a,
*, Neima Brauner
b
a
Department of Chemical Engineering, Ben -Gurion Uniersity of the Nege, Beer -Shea 84105, Israel
b
School of Engineering, Tel -Ai Uniersity Israel, Tel -Ai 69978, Israel
Received 26 March 1999; accepted 12 April 1999
Abstract
Construction of optimal (stable and of highest possible accuracy) regression models comprising of linear combination of
independent variables and their non-linear functions is considered. It is shown that estimates of the experimental error, which are
most often available for engineers and experimental scientists, are useful for identifying the set of variables to be included in an
optimal regression model. Two diagnostical indicators, which are based on experimental error estimates, are incorporated in an
orthogonalized-variable-based stepwise regression (SROV) procedure. The use of this procedure, followed by regression diagnos-
tics, is demonstrated in two examples. In the first example, a stable polynomial model for heat capacity is obtained, which is ten
times more accurate than the correlation published in the literature. In the second example, it is shown that omission of important
variables related to reaction conditions prevents reliable modeling of the product properties. © 1999 Elsevier Science S.A. All
rights reserved.
Keywords: Collinearity; Data; Noise; Precision; Stepwise regression
www.elsevier.com/locate/cep
1. Introduction
Obtaining experimental data is often very expensive
and time consuming. However, the accuracy and reli-
ability of process related calculations critically depend
on the accuracy, validity and stability of the regression
models fitted to the experimental data.
Regression models used for physico-chemical, ther-
modynamic or rate data can be partially theory based
or completely empirical. In both cases, it is not known
a-priori how many explanatory variables (independent
variables, and/or their functions) and parameters
should be included in the model for obtaining an
optimal regression model. An insufficient number of
explanatory variables results in an inaccurate model
characterized by a large variance. Some independent
variables which may have critical effects on the depen-
dent variable under certain circumstances, may be omit-
ted. On the other hand, the inclusion of too many
explanatory terms renders an unstable model. The in-
stability is characterized by typical ill effects, whereby
adding or removing an experimental point from the
data set may drastically change the parameter values.
Also, the derivatives of the dependent variable are not
represented correctly and extrapolation outside the re-
gion, where the measurements were taken, yields absurd
results even for a small range of extrapolation. Brauner
and Shacham [1–3] have demonstrated some of the ill
effects of including too many terms of regression
models.
The most frequent causes of inaccuracy and/or ill-
conditioning in regression are the following:
1. Non-optimal or inadequate model (not all the influ-
ential explanatory variables are included in the
model and/or non-influential variables are
included).
2. Excessive errors in the data (as in the presence of
outlying measurements).
3. Presence of collinearity among the explanatory
variables.
Dedicated to Professor Em. Dr-Ing. Dr h.c. mult. E.-U. Schlu ¨n-
der on the occasion of his 70th birthday.
* Corresponding author. Tel.: +972-7-6461481; fax: 972-7-
6472916.
E-mail address: shacham@bgumail.bgu.ac.il (M. Shacham)
0255-2701/99/$ - see front matter © 1999 Elsevier Science S.A. All rights reserved.
PII:S0255-2701(99)00044-6