Support Vector Machine and the Heuristic Method to Predict the Solubility of Hydrocarbons in Electrolyte Weiping Ma, ² Xiaoyun Zhang,* Feng Luan, ² Haixia Zhang, ² Ruisheng Zhang, Mancang Liu, ² Zhide Hu, ² and B. T. Fan § Department of Chemistry, Lanzhou UniVersity, Lanzhou 730000, China, Department of Computer Science, Lanzhou UniVersity, Lanzhou 730000, China, and UniVersite ´ Paris, 7-Denis Diderot, ITODYS 1, Rue Guy de la Brosse, 75005 Paris, France ReceiVed: January 10, 2005; In Final Form: March 3, 2005 A new method support vector machine (SVM) and the heuristic method (HM) were used to develop nonlinear and linear models between the solubility in electrolyte containing sodium chloride and three molecular descriptors of 217 nonelectrolytes. The molecular descriptors representing the structural features of the compounds include two topological and one electrostatic descriptor. The three molecular descriptors selected by HM in CODESSA were used as inputs for SVM. The results obtained by HM and SVM both were satisfactory. The model of HM leads to a correlation coefficient (R) of 0.980 and root-mean-square error (RMS) of 0.219 for the test set. The same descriptors were also employed to build the model in pure water, and the prediction results were consistent with the experimental solubilities. Furthermore, a predictive correlation coefficient R ) 0.988 and RMS error of 0.170 for the test set were obtained by SVM. The prediction results are in very good agreement with the experimental values. This paper provides a new and effective method for predicting the solubility in electrolyte and reveals some insight into the structural features that are related to the noneletrolytes. 1. Introduction It is well-known that saturated hydrocarbons are important constituents of petroleum products. Anthropogenic activity associated with the use of these compounds in chemical industry and in energy generation releases hydrocarbons into the environment. 1 The aqueous solubility of these compounds is an important molecular property, playing a large role in the behavior of compounds in many areas of interest. In modeling the environmental impact of a contaminant, alone with the soil- water absorption coefficient, the solubility is a key term in the understanding of transport mechanisms and distribution in water. The petroleum and petrochemical industries require this infor- mation for estimating the partition of hydrocarbons between aqueous and organic phase 2,3 and for minimizing the presence of hazardous solutes in aqueous effluents. 4 Environmental chemistry and engineering also need the data for modeling of the transport and fate of hydrocarbon pollutants in the environ- ment 5,6 and for the remediation of sites contaminated by petroleum spills. 7,8 The environmental risk for using these compounds should be assessed because these types of com- pounds are often the most long-lived of environmental con- taminants due to their comparatively low level of biodegrad- ability when compared to oxygen or nitrogen containing compounds. However, experimental solubility data are rather scarce for saturated hydrocarbons with 10 or more carbon atoms. Whereas a general equation would be of greatest use, the present study is limited to hydrocarbons which were expected to be advantageous in obtaining a significant correlation, as the elimination of compounds that will undergo specific interactions with water, such as hydrogen bonding, simplifies the nature of the interactions that must be accounted for. Property of hydrocarbons in water saturated with salt is useful upon its contact with seawater. Given the importance of solubility, a potential theoretical method for predicting the solubility is desired, as many compounds exist for which the solubility simply is not available. Quantitative structure-property relationships (QSPR) studies have been demonstrated to be an effective computational tool in understanding the correlation between the structure of molecules and their properties. 9-11 In a QSPR study, one seeks to find a mathematical relationship between the property and one or more descriptors. Thus, this study can indicate which of the structural factors may play an important role in the determination of a property. Furthermore, its advantage over other methods lies in the fact that the descriptors used can be calculated from the structure alone and are not dependent on any experimental properties. However, the main problems encountered in this kind of research are still the description of the molecular structure using appropriate molecular descriptors and selection of suitable modeling methods. At present, many types of molecular descriptors such as constitutional, topological, geometrical, electrostatic, and quantum chemical descriptors have been proposed to describe the structural features of molecules. 12-14 The same as the diversity of molecular descrip- tors many different chemometrics and chemoinformatics meth- ods, such as multiple linear regression (MLR), principal component regression (PCR), partial least squares (PLS), different types of artificial neural networks (ANN), and genetic * Corresponding author. Tel.: +86-931-891-2578. Fax: +86-931-891- 2582. E-mail address: xyzhang@lzu.edu.cn. ² Department of Chemistry, Lanzhou University. Department of Computer Science, Lanzhou University. § Universite ´ Paris. 3485 J. Phys. Chem. A 2005, 109, 3485-3492 10.1021/jp0501446 CCC: $30.25 © 2005 American Chemical Society Published on Web 03/30/2005