QSARINS: software for the development, analysis and validation of MLR models, and QSARINS-Chem: Insubria datasets and QSA(P)R models for environmental pollutants Paola Gramatica , Nicola Chirico, Alessandro Sangion and Stefano Cassani QSAR Research Unit in Environmental Chemistry and Ecotoxicology, Department of Theoretical and Applied Sciences, University of Insubria, Varese, Italy. Contact: paola.gramatica@uninsubria.it, WEB : www.qsar.it [1] P. Gramatica, N. Chirico, S. Kovarich, S. Cassani, E. Papa. J. Comput. Chem. 2013, 34, 2121-2132. [2] R. L. Haupt, S. E. Haupt, in Practical Genetic Algorithms – 2nd ed.; Wiley-Intersci. pub., 2004 [3] N. Chirico, P. Gramatica. J Chem. Inf. Model. a) 2011, 51, 2320-2335; b) 2012, 52, 2044-2058 [4] P. Gramatica, P. Pilutti, E. Papa. J. Chem. Inf. Comput. Sci. 2004, 44, 1794-1802. [5] H. Zhu, A. Tropsha, D. Fourches, A. Varnek, E. Papa, P. Gramatica, T. Öberg, P. Dao, A. Cherkasov, I. V. Tetko. J. Chem. Inf. Model. 2008, 48, 766-784. [6] H. R. Keller, D. L. Massart, J. P. Brands. Chemom. Intell. Lab. Syst. 1991, 11, 175. [7] P. Gramatica, S. Cassani, N. Chirico. J. Comput. Chem. 2014, 35, 1036-1044. [8] E. Papa, P. Gramatica. Green Chem. 2010, 12, 836-843. Selected as Hot Article. 1 - DATA IMPORT and ANALYSIS This work was supported by the European Union through the project CADASTER FP7-ENV- 2007-1-212668 2 - DATA SETUP AND DESCRIPTOR SELECTION  All subset Models: Calculating all models corresponding to all descriptors combinations  Genetic Algorithm (GA) [2]: is applied in QSARINS to develop models without exploring all the possible combinations of descriptors. The selection of models is done by maximizing (or minimizing) a selected fitness function. The tuning of the GA can be done by varying the population size, the mutation rate and the number of generations 3 - MODEL ANALYSIS AND VALIDATION A population of several models, with similar good performances and based on different descriptors, can be obtained by GA. In QSARINS you can:  Visualize all the models or only a selected portion  Sort models according to: fitting, robustness, stability, lowest correlation with descriptors, highest correlation with the response, RMSE, Q 2 LOO and external predictivity indices [3 a,b] Asses the relative importance of the modeling descriptors, by counting their occurrences among all models One of the main aims of the QSAR practice is to calculate a model able to make good predictions for new chemicals, also for not yet synthesized ones (chemical design ). In QSARINS the user can:  Store any MLR-OLS model (also developed by other software) for its validation and for a later application  Calculate predictions for new chemicals, automatically applying the model ABSTRACT  Import the matrix containing the studied chemicals, the corresponding descriptors values and the experimental responses  Descriptors can be pre-filtered (eg by pair-wise correlation), also considering a pre-existing molecular splitting Data import Pre-filtering by: Constant values Pair-wise correlation Data analysis and setup Dataset profiles Structural PCA Data selection QSAR model Descriptor selection All subset Genetic Algorithm (GA) Filtering rules (QUIK) Model analysis and validation Performances analysis (R 2 ,Q 2 LOO , RMSE,Q 2 LMO, Y-scr) Ranking of the models Graphical inspection Scatter plots, Williams plot and Insubria graph for AD, validation plots External validation and model selection Validation by different criteria (Q 2 Fn , CCC, etc.) on external prediction set Multi-Criteria Decision Making (MCDM) Consensus modeling Multiple selection of models by PCA based on residuals or descriptors Data splitting: random, ordered by response or by structure (PCA) QSARINS [1] WORKFLOW QSA(P)R models, when correctly developed and rigorously validated, are highly useful for screening and prioritizing chemicals without experimental data, even before their synthesis: this can be done in the ”benign by design” approach in green chemistry. Their use in regulation is strongly suggested by the European legislation of chemicals REACH, in particular to reduce tests on animals and experimental costs. Recently, particular attention has been devoted to the validation of QSAR models, and the “OECD principles for the validation of QSARs models for their application in regulation” have been established to increase the reliability of data predicted by QSAR models. We here propose the new software QSARINS (QSAR-INSUBRIA), for the development of Multiple Linear Regression (MLR) models, by Ordinary Least Squares (OLS) and Genetic Algorithm (GA) for variable selection. This program is mainly focused on internal and external validation of models by different statistical parameters, and is a user-friendly platform for QSAR modeling in agreement with the OECD Principles and for the analysis of reliability of the predictions. Additional features implemented include tools for explorative analysis of the datasets by Principal Component Analysis (PCA), dataset splitting, Applicability Domain analysis (e.g., detection of outliers and interpolated or extrapolated predictions), consensus modelling, selection of the best model by Multi-Criteria Decision Making (MCDM) and various informative and useful plots. QSARINS-Chem, a specific module of QSARINS, includes several datasets of environmental pollutants with the chemical structures (Hyperchem and MDL MOL formats) and the corresponding end- points (physico-chemical properties and biological activities), modeled by Insubria group during the last fifteen years. The chemicals with the related available data can be accessed in different ways (by CAS RN, SMILES, names, etc.) and their 3D structure can be visualized. Additionally, some QSAR models based on molecular 0-2D descriptors calculated by the free open source software PaDEL- Descriptor are implemented in QSARINS-Chem. Among them, there is the Insubria Persistent Bioaccumulative and Toxic (PBT) Index model for the prediction of the cumulative behavior of new chemicals as PBTs. The new PaDEL-Descriptor models can be easily applied for future predictions on chemicals without experimental data, checking the Applicability Domain. The QMRF of all these PaDEL- Descriptor models is available. QSARINS-Chem can be also used as a management tool of personal datasets and models and additional chemometric analyses can be done by PCA and MCDM for screening and ranking chemicals in order to prioritize the most dangerous. Data can be inspected by various graphs to display, for example, the variable distribution in the dataset, the distribution of the compounds for each descriptor and the correlation among data. Principal Component Analysis (PCA) can be also applied to the dataset: SCORE and LOADINGS plots. PCA SCORE PLOT: to check the existence of molecules clustered and/or to detect strong structural outliers  Selection of the molecular descriptors to be included in the variable selection procedure  Selection of the response (endpoint) to be modeled  Selection of the status of the molecules under study (training, prediction, excluded) Plot of Predicted vs. Experimental responses Graphical inspection of the model calculations Graphical inspection of the Applicability Domain of the models Williams Plot Plot of std. residuals vs. hat diagonal values Insubria Graph Plot of predicted values vs. hat diagonal values Check of model robustness and “not-by- chance” Q 2 LMO plot Y-Scrambling plot Search the chemicals by: -CAS Registry Number -SMILES -Empirical formula -Name Available experimental data 4 – DATABASE OF CHEMICALS AND MODELS / QSARINS-Chem A module called QSARINS-Chem [7], contains Insubria datasets and new QSAR/QSPR models for environmental pollutants.  46 datasets: 2780 minimized chemicals (AM1 method, HyperChem software) with CAS, Name, SMILES and related 6067 experimental data 23 QSAR and QSPR models developed with descriptors freely calculated in PaDEL-Descriptor software; among these, there is also the PBT Index model [8] 3D visualization of chemicals The PBT Index QSAR model, developed within our research group [8], predicts the potential PBT (Persistent Bioaccumulative and Toxic) cumulative behavior of organic chemicals solely on the basis of their molecular structure 23 available QSAR/QSPR models, with their QMRF PBT Not PBT Ranking tools based on PCA and MCDM are also included in QSARINS-Chem module, in order to highlight the most hazardous chemicals CONSENSUS MODELING It is possible to generate a consensus model from a list of chosen models, in order to improve the accuracy of predictions [4,5]. M1 M2 M3 Consensus Model Averaging predictions REFERENCES For chemicals with exp. data For chemicals without exp. data Kxy is lost It is essential to check various model performances, i.e. fitting, their stability by cross-validation, the absence of chance correlation, and their capability to predict new chemicals (external validation). These performances are measured by appropriate formulas/methods of various validation criteria [3a,b]. It is possible to apply Multi- Criteria Decision Making (MCDM) [6] to rank the “best” models selected by the user, exploiting at the same time all the available information regarding the model validation. The selection of the most diverse models to be averaged can be done by the PCA of the residuals in prediction [4] or of the modeling descriptors MCDM-ext vs MCDM-fit plot MP 112