Pattern and Knowledge Extraction using
Process Data Analytics: A Tutorial
Yiting Tsai
*
Qiugang Lu
*
Lee Rippon
*
Siang Lim
*
Aditya Tulsyan
**
Bhushan Gopaluni
*
*
Department of Chemical and Biological Engineering, University of
British Columbia, Vancouver, BC V6T 1Z3, Canada
(yttsai@chbe.ubc.ca; qglu@chbe.ubc.ca; leeripp@chbe.ubc.ca;
siang.lim@ubc.ca; bhushan.gopaluni@ubc.ca)
**
Department of Chemical Engineering, Massachusetts Institute of
Technology, Cambridge, MA 02139, USA (e-mail: tulsyan@mit.edu)
Abstract: Traditional techniques employed by control engineers require a significant update
in order to handle the increasing complexity of modern processes. Conveniently, advances in
statistical machine learning and distributed computation have led to an abundance of techniques
suitable for advanced analysis. In this tutorial we introduce data analytics techniques and discuss
their theory and application to chemical processes. Although the focus is more on theory, the
applications will be explored more widely in a follow-up journal paper. The ultimate goal is to
familiarize control engineers with how these techniques are used to extract valuable knowledge
from raw data, which can then be utilized to make smarter process control decisions.
Keywords: Machine learning, Classification, Regression, Neural networks, Gaussian processes
1. INTRODUCTION
Process control in the chemical industry is undergoing a
dramatic revolution, as the ability to extract knowledge
from data is becoming a reality. Immense amounts of
historical data are available from advanced sensing and
actuation devices. Due to computational and theoretical
limitations, past control strategies have been limited by
the use of small and cleanly-structured data. By doing so,
a significant amount of potentially valuable information is
dismissed. However, the recent development of powerful
computing tools have allowed machine learning (ML)
algorithms to be fully utilized on big data. Engineers are
thus equipped with the ability to extract knowledge from
data and use it to improve control actions.
To optimally leverage both historical data and advanced
sensor data, control engineers can employ a new set of
intuitive and user-friendly tools. In this tutorial we intro-
duce basic background knowledge behind ML and model
validation, then explore the theory behind important clas-
sification and regression algorithms. Finally, the paper
concludes after presenting techniques for dimensional re-
duction, advanced learning and handling of dynamic data.
2. BACKGROUND AND BASIC TERMINOLOGY
The nomenclature found in standard ML texts such as
Bishop (2006) and Murphy (2012) represent the input data
matrix as X and output data matrix as Y . Moreover, X
contains N rows of samples, and d
x
columns of variables.
⋆
This work was supported in part by the Institute for Computing,
Information and Cognitive Systems (ICICS) at UBC.
Input and output data are divided into training, valida-
tion and testing sections. For datasets with a few hun-
dred or thousand samples, the training/validation/testing
ratio ranges between 50/25/25 and 90/5/5. However, for
datasets with more than a million entries, a ratio between
90/10/10 and 98/1/1 is preferred according to Murphy
(2012) and Goodfellow et al. (2016). The purpose of each
data division is as follows:
• Training: Mathematical mappings between the in-
put and output data (known as models ) are formed.
• Validation: For each model formed during training,
the one (with the combination of parameters and
hyperparameters) which results in the lowest error is
selected, using k-fold cross-validation.
• Testing: The capability of the selected model to
generalize to new samples is measured by evaluating
its prediction accuracy.
Note that the test set cannot influence the choice of
model structure, parameters or hyperparameters (Murphy,
2012). The training, validation and testing errors are
evaluated as either
Error Fraction =
1
n
n
i=1
1(ˆ y
i
= y
i
) (Classification), (1)
Error Rate =
1
n
n
i=1
f (ˆ y
i
− y
i
) (Regression), (2)
where 1 represents the indicator function and ˆ y
i
the
estimated output for the i
th
sample using the selected
model. The error is calculated as a fraction of mismatched
samples in the case of classification, or as an average sum
of errors in the case of regression, where f is the case-
Preprints, 10th IFAC International Symposium on
Advanced Control of Chemical Processes
Shenyang, Liaoning, China, July 25-27, 2018
Copyright © 2018 IFAC 13