Pattern and Knowledge Extraction using Process Data Analytics: A Tutorial Yiting Tsai * Qiugang Lu * Lee Rippon * Siang Lim * Aditya Tulsyan ** Bhushan Gopaluni * * Department of Chemical and Biological Engineering, University of British Columbia, Vancouver, BC V6T 1Z3, Canada (yttsai@chbe.ubc.ca; qglu@chbe.ubc.ca; leeripp@chbe.ubc.ca; siang.lim@ubc.ca; bhushan.gopaluni@ubc.ca) ** Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA (e-mail: tulsyan@mit.edu) Abstract: Traditional techniques employed by control engineers require a significant update in order to handle the increasing complexity of modern processes. Conveniently, advances in statistical machine learning and distributed computation have led to an abundance of techniques suitable for advanced analysis. In this tutorial we introduce data analytics techniques and discuss their theory and application to chemical processes. Although the focus is more on theory, the applications will be explored more widely in a follow-up journal paper. The ultimate goal is to familiarize control engineers with how these techniques are used to extract valuable knowledge from raw data, which can then be utilized to make smarter process control decisions. Keywords: Machine learning, Classification, Regression, Neural networks, Gaussian processes 1. INTRODUCTION Process control in the chemical industry is undergoing a dramatic revolution, as the ability to extract knowledge from data is becoming a reality. Immense amounts of historical data are available from advanced sensing and actuation devices. Due to computational and theoretical limitations, past control strategies have been limited by the use of small and cleanly-structured data. By doing so, a significant amount of potentially valuable information is dismissed. However, the recent development of powerful computing tools have allowed machine learning (ML) algorithms to be fully utilized on big data. Engineers are thus equipped with the ability to extract knowledge from data and use it to improve control actions. To optimally leverage both historical data and advanced sensor data, control engineers can employ a new set of intuitive and user-friendly tools. In this tutorial we intro- duce basic background knowledge behind ML and model validation, then explore the theory behind important clas- sification and regression algorithms. Finally, the paper concludes after presenting techniques for dimensional re- duction, advanced learning and handling of dynamic data. 2. BACKGROUND AND BASIC TERMINOLOGY The nomenclature found in standard ML texts such as Bishop (2006) and Murphy (2012) represent the input data matrix as X and output data matrix as Y . Moreover, X contains N rows of samples, and d x columns of variables. This work was supported in part by the Institute for Computing, Information and Cognitive Systems (ICICS) at UBC. Input and output data are divided into training, valida- tion and testing sections. For datasets with a few hun- dred or thousand samples, the training/validation/testing ratio ranges between 50/25/25 and 90/5/5. However, for datasets with more than a million entries, a ratio between 90/10/10 and 98/1/1 is preferred according to Murphy (2012) and Goodfellow et al. (2016). The purpose of each data division is as follows: Training: Mathematical mappings between the in- put and output data (known as models ) are formed. Validation: For each model formed during training, the one (with the combination of parameters and hyperparameters) which results in the lowest error is selected, using k-fold cross-validation. Testing: The capability of the selected model to generalize to new samples is measured by evaluating its prediction accuracy. Note that the test set cannot influence the choice of model structure, parameters or hyperparameters (Murphy, 2012). The training, validation and testing errors are evaluated as either Error Fraction = 1 n n i=1 1y i = y i ) (Classification), (1) Error Rate = 1 n n i=1 f y i y i ) (Regression), (2) where 1 represents the indicator function and ˆ y i the estimated output for the i th sample using the selected model. The error is calculated as a fraction of mismatched samples in the case of classification, or as an average sum of errors in the case of regression, where f is the case- Preprints, 10th IFAC International Symposium on Advanced Control of Chemical Processes Shenyang, Liaoning, China, July 25-27, 2018 Copyright © 2018 IFAC 13