Application of ALOGPS 2.1 to Predict log D Distribution Coefficient for Pfizer Proprietary Compounds Igor V. Tetko* , and Gennadiy I. Poda Biomedical Department, Institute of Bioorganic and Petroleum Chemistry, Ukrainian Academy of Sciences, Murmanskaya 1, Kyiv, 02094, Ukraine, and Structural and Computational Chemistry, Pfizer Global Research and Development, 700 Chesterfield Parkway West, Chesterfield, Missouri 63017 Received June 22, 2004 Abstract: Evaluation of the ALOGPS, ACD Labs LogD, and PALLAS PrologD suites to calculate the log D distribution coefficient resulted in high root-mean-squared error (RMSE) of 1.0-1.5 log for two in-house Pfizer’s log D data sets of 17 861 and 640 compounds. Inaccuracy in log P prediction was the limiting factor for the overall log D estimation by these algo- rithms. The self-learning feature of the ALOGPS (LIBRARY mode) remarkably improved the accuracy in log D prediction, and an rmse of 0.64-0.65 was calculated for both data sets. Oral bioavailability of chemicals is a very important pharmacokinetic parameter in drug development. To reach the target enzyme in the human body, drugs have to cross barriers by passive diffusion or carrier-mediated uptake. The 1-octanol-water partition coefficient, log P, is well-known as one of the principal parameters to estimate lipophilicity (or solubility in lipids) of chemical compounds and, to a large degree, determines their pharmacokinetic properties. The log P is also used as one of the standard properties identified by Lipinski in the “rule of 5” for druglike molecules. 1 By definition log P refers to neutral molecules. If a molecule contains basic or acidic groups, it becomes ionized and its distribution in octanol-water becomes pH-dependent. The pH-dependent distribution coefficient, log D, was shown to correlate with a number of biological param- eters, such as the effective permeability in human jejunum, 2 blood-brain barrier (BBB) permeability, 3 plasma protein binding, 4 CYP 450 oxidation, 5 and volume of distribution (V D ). 6,7 Oral drugs, to be able to be absorbed by passive diffusion through the gut wall, should have their lipophilicity within a given range (usually between 1 and 4 on the log D scale). Both coefficients log P and log D are very important parameters in drug development, 8 and thus, there is a need to develop new methods to accurately calculate them from chemical structures. Currently, the amount of publicly available experimental log P data comprises tens of thousands of compounds. 9 These resources stimulated development of a number of programs to calculate it. 10-15 The problem of predicting log D is more complicated. As a rule, it is computed from log P and pK a assuming that only the neutral form partitions into the organic phase as 12,16 where Δ i ) {1, -1} for acids and bases, respectively. If several groups can be ionized, the equation is modified accordingly to incorporate correction terms for all of them. Thus, the log D prediction potentially accumulates errors due to the log P and pK a predictions. Development of computational approaches is further complicated because of the absence of publicly available large data sets with experimental log D values. As a result, only a few programs are available to estimate the log D. 12 A recent evaluation of two commercial programs calculated a root-mean-squared error (rmse) of 1.4-1.9 log units for a data set of about 20 000 compounds 17 that is not accurate for practical usage. Therefore, large pharmaceutical companies such as Pfizer and AstraZeneca have established their own techniques to experimentally determine log D for their proprietary compounds. The ALOGPS program 18-20 (http://www.vcclab.org) was developed using the associative neural network (ASNN) method. 21,22 The ASNN provides a possibility to include new data into the memory of neural nets without retraining the neural networks themselves in the so-called LIBRARY mode (further LIBRARY). 19 The LIBRARY dramatically improved prediction of the ALOGPS program for the log P prediction using in- house data sets from BASF, 21 Pfizer, 23 and Astra- Zeneca. 24,25 The current study demonstrates that the ALOGPS is also able to reliably predict the pH- dependent distribution coefficient, log D. The octanol-water partition data used in this study was collected at two Pfizer sites and contributed to two data sets. The first data set included 669 legacy Phar- macia compounds with log D values measured by a medium-throughput method using a nitrogen detector (called the NlogD set). A typical experimental error in log D measurements is about 0.3-0.5 log units. The second data set (ElogD set) included 18 889 compounds measured using the ElogD method. 26,27 An inspection of compounds indicated that both sets were not overlap- ping. For compounds that had multiple measurements average values were used. Also, because the ALOGPS method does not take into account stereoselectivity, average values were used for stereoisomers. After removal of structural duplicates and stereoisomers, the numbers of compounds decreased to 640 and 17 861 for NlogD and ElogD data sets, respectively. For comparison, ACD Labs LogD v.7.19 28 and PALLAS PrologD software 29 was used to calculate log D values at pH 7.4 for ElogD and NlogD data sets. The stand-alone graphical-based interface versions of ALOGPS and ASNN were used to perform analysis of compounds using three protocols. In the first protocol, the ALOGPS program was used “as is” to calculate a blind prediction of molecules from each data set. In the second protocol, the self-learning feature implemented as a “LIBRARY” mode of ALOGPS 2.1 was * To whom correspondence should be addressed. Address: Institute for Bioinformatics GSF, Forschungszentrum fu ¨ r Umwelt und Gesund- heit, GmbH, Ingolsta ¨ dter Landstrasse 1, D-85764 Neuherberg, Ger- many. Phone: +49-89-3187-3575. Fax: +49-89-3187-3585. E-mail: itetko@vcclab.org. Institute of Bioorganic and Petroleum Chemistry. Pfizer Global Research and Development. log D(pH) ) log P - log(1 + 10 (pH-pK a ) i ) (1) 5601 J. Med. Chem. 2004, 47, 5601-5604 10.1021/jm049509l CCC: $27.50 © 2004 American Chemical Society Published on Web 10/05/2004