Journal of Chromatography A, 988 (2003) 261–276 www.elsevier.com / locate / chroma Classification and regression tree analysis for molecular descriptor selection and retention prediction in chromatographic quantitative structure–retention relationship studies a a,b a a,c a a, * R. Put , C. Perrin , F. Questier , D. Coomans , D.L. Massart ,Y. Vander Heyden a ChemoAC, Department of Pharmaceutical and Biomedical Analysis, Pharmaceutical Institute, Vrije Universiteit Brussel, Laarbeeklaan 103, B-1090 Brussels, Belgium b ´ ´ Laboratoire de Chimie Analytique, Faculte de Pharmacie, Universite Montpellier 1, 15 avenue Charles Flahault, BP 14 491, 34093 Montpellier Cedex 5, France c Statistics and Intelligent Data Analysis Group, School of Mathematical and Physical Sciences, James Cook University, Townsville Q4814, Australia Received 11 November 2002; received in revised form 20 December 2002; accepted 20 December 2002 Abstract The use of the classification and regression tree (CART) methodology was studied in a quantitative structure–retention relationship (QSRR) context on a data set consisting of the retentions of 83 structurally diverse drugs on a Unisphere PBD column, using isocratic elutions at pH 11.7. The response (dependent variable) in the tree models consisted of the predicted retention factor (log k ) of the solutes, while a set of 266 molecular descriptors was used as explanatory variables in the tree w building. Molecular descriptors related to the hydrophobicity (log P and Hy) and the size (TPC) of the molecules were selected out of these 266 descriptors in order to describe and predict retention. Besides the above mentioned, CART was also able to select hydrogen-bonding and molecular complexity descriptors. Since these variables are expected from QSRR knowledge, it demonstrates the potential of CART as a methodology to understand retention in chromatographic systems. The potential of CART to predict retention and thus occasionally to select an appropriate system for a given mixture was also evaluated. Reasonably good prediction, i.e. only 9% serious misclassification, was observed. Moreover, some of the misclassifications probably are inherent to the data set applied.  2003 Elsevier Science B.V. All rights reserved. Keywords: Molecular descriptors; Retention prediction; Regression analysis; Structure–retention relationships 1. Introduction pharmaceutical analysis. Its ability to analyse a wide polarity range of acidic, basic and neutral com- High-performance liquid chromatography (HPLC) pounds, and its high separative capabilities combined is the most widely used separation technique in with automation, make HPLC the most efficient technique for the analytical characterisation of the continuously growing number of samples, produced *Corresponding author. Tel.: 132-2-477-4723; fax: 132-2- at the different stages of drug development [1]. 477-4735. E-mail address: yvanvdh@fabi.vub.ac.be (Y. Vander Heyden). Related to the application of combinatorial 0021-9673 / 03 / $ – see front matter  2003 Elsevier Science B.V. All rights reserved. doi:10.1016 / S0021-9673(03)00004-9