A New Efficient Approach for Variable Selection Based on Multiregression: Prediction of Gas Chromatographic Retention Times and Response Factors Bono Luc ˇic ´* and Nenad Trinajstic ´ The Rugjer Bos ˇkovic ´ Institute, P.O. Box 1016, HR-10001 Zagreb, Croatia Sulev Sild, ²,‡ Mati Karelson, and Alan R. Katritzky* Center for Heterocyclic Compounds, Department of Chemistry, University of Florida, P.O. Box 117200, Gainesville, Florida 32611-7200, and Department of Chemistry, University of Tartu, Jakobi Street 2, EE 2400 Tartu, Estonia Received November 11, 1998 The selection of the most relevant variable is a frequent problem in the analysis of chemical data, especially now considering the large amounts of data created by the increased computer power and analytical resolution. A novel procedure for variable selection based on multiregression (MR) analysis is developed and applied to the quantitative structure-property relationship (QSPR) modeling of gas chromatographic retention times t R and Dietz response factors RF on 152 diverse chemical compounds. Using 296 descriptors generated by the CODESSA program, “absolutely the best” linear MR models containing from 1 to 5 descriptors were first selected (2 × 10 10 models were checked), and then “the best” linear stepwise MR models with six and seven descriptors were obtained through “i by i” stepwise selection. In this paper i was varied from 1 to 4, so that in each next step i descriptors were added to the previously selected descriptors. Nonlinear models were developed by the inclusion of cross-products of initial descriptors. We selected as the most important descriptors for t R the number of C-H and C-X bonds, connectivity indices of order 3, the highest normal mode vibrational frequency, and the rotational entropy of the molecule at 300 K. In the case of RF modeling the most important descriptors are those related to the relative number and weight of effective C atoms, the orbital electronic population, and the bond order and valency of C and H atoms. Comparison with the best six-descriptor models obtained by the normal CODESSA procedure shows that nonlinear seven-descriptor MR models now obtained achieve 30% (0.3520 vs 0.5032) and 12% (0.0472 vs 0.0530) less standard errors of estimate for t R and RF, respectively. Our novel procedure of selecting a small number of the most important descriptors from a data set allows us to extract a larger amount of useful information than with the procedure implemented in CODESSA. Thus, our new procedure enables the selection of the best possible MR models from 10 10 possibilities. Through the introduction of cross-product terms, we obtained nonlinear MR models which are superior to the corresponding linear models. INTRODUCTION The most important aim of mathematical and statistical methods in chemistry is to provide the maximum information about selected molecular property by analyzing chemical data. The quality of a method is reflected in its ability to extract the most relevant information starting from a standard data set. In the case of quantitative structure-property or structure-activity modeling on a given set of molecules, sometimes a large number of descriptors is produced in the first step. Nowadays, there are computer programs available by which one can generate several hundreds of descriptors for modestly sized molecules. Among them, the ADAPT 1-7 and CODESSA programs 8-19 are often used. The use of these programs in structure-property modeling, very often, results in more descriptors than the number of molecules in the data set. A much studied problem is how many descriptors should be used in the final model. 20 A related question (which stems from Ockham’s Razor; 21 to prefer the model realized with the fewest descriptors, other things being equal) is how to select a small number of the most important descriptors from a large data set. Generally, the selection of a small number of important descriptors from a large initial set can be carried out by selecting the relatively small number of descriptors that contain the maximum retention time mapping information (for such a method we used the term “inductive”), reducing the total pool of descriptors by removing those that contain the maximum amount of redundant information (called “deductive” method), or using a combination of the deductive and inductive methods. Most popular program packages solve the problem of selection of the most important descriptors inductively, in a stepwise manner, selecting one descriptor at a time (“one by one” stepwise selection). 8 However, several recently published algorithms for variable selection are essentially deductive 22,23 or utilize a combination of the deductive and inductive approaches. 8,24,25 One of the these is CODESSA. 8 This program can generate a large number ² University of Florida. University of Tartu. 610