Quantitative Structure-Activity Relationships: Linear Regression Modelling and Validation Strategies by Example QSARs-LRM Modelling and Validation Strategies by Example Sorana D. BOLBOACĂ * * "Iuliu Haţieganu" University of Medicine and Pharmacy Cluj-Napoca, Department of Medical Informatics and Biostatistics, 6 Louis Pasteur, 400349 Cluj-Napoca. E-mail: sbolboaca@umfcluj.ro Lorentz Jäntschi ** ** Technical University of Cluj-Napoca, Department of Physics and Chemistry, 103-105 Muncii Bvd., 400641 Cluj-Napoca, Romania. E-mail: lorentz.jantschi@gmail.com Abstract—Quantitative structure-activity relationships are mathematical models constructed based on the hypothesis that structure of chemical compounds is related to their biological activity. A linear regression model is often used to estimate and predict the nature of the relationships between a measured activity and some measure or calculated descriptors. Linear regression helps to answer main three questions: does the biological activity depend on structure information; if so, the nature of the relationship is linear; and if yes, how good is the model in prediction of the biological activity of new compounds. This manuscript presents the steps on linear regression analysis moving from theoretical knowledge to an example conducted on sets of endocrine disrupting chemicals. Keywords-robust regression; validation; diagnostic; predictive power; quantitative structure- activity relationships (QSARs) I. BRIEF HISTORY OF LINEAR REGRESSION Linear regression analysis is used in life science researches to describe the strength of the association between outcome and factors of interest, to adjust data for covariates or co-founders, to identify predictors (factors that affect the outcome) and/or to predict the outcome [1]. It could be considered that Sir Francis Galton provided the initial inspiration that led to correlation and regression. The fundamentals of correlation were discussed by Bravais [2] who presented the correlation of two and three variables. Galton improved notation as "Galton function" of correlation coefficient (r); this function could be found in Bravais' work but not as a single symbol. Edgeworth indicated in 1892 how to extend the Bravais' method to higher degree of correlation [3] and expressed his results in terms of "Galton's function". Galton used regression to understand heredity and suggested a slope of 0.33 that showed the relationships between extremely large or small mother peas seed and their less extreme daughter seeds [4,5]. Galton seems to build the regression analysis based on the work of Adolphe Quetelet who is known to be the first scientists that applied in a systematically way a statistical methods to human [6]. Furthermore, Quetelet showed normal distributions in diverse aggregated data [6]. Galton was able to fit all data in a single line and he abbreviated the slope of this line as "r" [7], later this symbol being use to stand for correlation coefficient [8]. Pearson demonstrated in 1896 that optimum values of slope and correlation coefficient could be calculated from the product-moment [8]. On the same time, George Yule refined regression analysis [9], [10], [11], solving his regression problem by minimizing the sum of squares error [9,10], method that was presented for the first time by Legendre in 1805 [12]. II. LINEAR REGRESSION ON QSAR ANALYSIS Quantitative structure-activity relationships (QSARs) are mathematical models linking chemical structure and pharmacological activity/property in a quantitative manner for a series of compounds [13]. The approaches are based on the assumption that the structure of chemical compounds (such as geometric, topologic, steric, electronic properties, etc.) contains features responsible for its physical, chemical and/or biological properties [14]. This assumption could be summarized as "similar compounds have similar properties" [15]. The two main fields were linear regression analysis found its applicability are drug discovery [16], [17] and toxicology prediction [18], [19]. In both of these fields, the linear regression is used mainly to predict not to estimate (the model is used to quickly determine the