Chemical Engineering and Processing 38 (1999) 477 – 486 Considering precision of experimental data in construction of optimal regression models Mordechai Shacham a, *, Neima Brauner b a Department of Chemical Engineering, Ben -Gurion Uniersity of the Nege, Beer -Shea 84105, Israel b School of Engineering, Tel -AiUniersity Israel, Tel -Ai69978, Israel Received 26 March 1999; accepted 12 April 1999 Abstract Construction of optimal (stable and of highest possible accuracy) regression models comprising of linear combination of independent variables and their non-linear functions is considered. It is shown that estimates of the experimental error, which are most often available for engineers and experimental scientists, are useful for identifying the set of variables to be included in an optimal regression model. Two diagnostical indicators, which are based on experimental error estimates, are incorporated in an orthogonalized-variable-based stepwise regression (SROV) procedure. The use of this procedure, followed by regression diagnos- tics, is demonstrated in two examples. In the first example, a stable polynomial model for heat capacity is obtained, which is ten times more accurate than the correlation published in the literature. In the second example, it is shown that omission of important variables related to reaction conditions prevents reliable modeling of the product properties. © 1999 Elsevier Science S.A. All rights reserved. Keywords: Collinearity; Data; Noise; Precision; Stepwise regression www.elsevier.com/locate/cep 1. Introduction Obtaining experimental data is often very expensive and time consuming. However, the accuracy and reli- ability of process related calculations critically depend on the accuracy, validity and stability of the regression models fitted to the experimental data. Regression models used for physico-chemical, ther- modynamic or rate data can be partially theory based or completely empirical. In both cases, it is not known a-priori how many explanatory variables (independent variables, and/or their functions) and parameters should be included in the model for obtaining an optimal regression model. An insufficient number of explanatory variables results in an inaccurate model characterized by a large variance. Some independent variables which may have critical effects on the depen- dent variable under certain circumstances, may be omit- ted. On the other hand, the inclusion of too many explanatory terms renders an unstable model. The in- stability is characterized by typical ill effects, whereby adding or removing an experimental point from the data set may drastically change the parameter values. Also, the derivatives of the dependent variable are not represented correctly and extrapolation outside the re- gion, where the measurements were taken, yields absurd results even for a small range of extrapolation. Brauner and Shacham [1–3] have demonstrated some of the ill effects of including too many terms of regression models. The most frequent causes of inaccuracy and/or ill- conditioning in regression are the following: 1. Non-optimal or inadequate model (not all the influ- ential explanatory variables are included in the model and/or non-influential variables are included). 2. Excessive errors in the data (as in the presence of outlying measurements). 3. Presence of collinearity among the explanatory variables. Dedicated to Professor Em. Dr-Ing. Dr h.c. mult. E.-U. Schlu ¨n- der on the occasion of his 70th birthday. * Corresponding author. Tel.: +972-7-6461481; fax: 972-7- 6472916. E-mail address: shacham@bgumail.bgu.ac.il (M. Shacham) 0255-2701/99/$ - see front matter © 1999 Elsevier Science S.A. All rights reserved. PII:S0255-2701(99)00044-6