Support Vector Machine and the Heuristic Method to Predict the Solubility of
Hydrocarbons in Electrolyte
Weiping Ma,
²
Xiaoyun Zhang,*
,²
Feng Luan,
²
Haixia Zhang,
²
Ruisheng Zhang,
‡
Mancang Liu,
²
Zhide Hu,
²
and B. T. Fan
§
Department of Chemistry, Lanzhou UniVersity, Lanzhou 730000, China, Department of Computer Science,
Lanzhou UniVersity, Lanzhou 730000, China, and UniVersite ´ Paris, 7-Denis Diderot, ITODYS 1,
Rue Guy de la Brosse, 75005 Paris, France
ReceiVed: January 10, 2005; In Final Form: March 3, 2005
A new method support vector machine (SVM) and the heuristic method (HM) were used to develop nonlinear
and linear models between the solubility in electrolyte containing sodium chloride and three molecular
descriptors of 217 nonelectrolytes. The molecular descriptors representing the structural features of the
compounds include two topological and one electrostatic descriptor. The three molecular descriptors selected
by HM in CODESSA were used as inputs for SVM. The results obtained by HM and SVM both were
satisfactory. The model of HM leads to a correlation coefficient (R) of 0.980 and root-mean-square error
(RMS) of 0.219 for the test set. The same descriptors were also employed to build the model in pure water,
and the prediction results were consistent with the experimental solubilities. Furthermore, a predictive correlation
coefficient R ) 0.988 and RMS error of 0.170 for the test set were obtained by SVM. The prediction results
are in very good agreement with the experimental values. This paper provides a new and effective method
for predicting the solubility in electrolyte and reveals some insight into the structural features that are related
to the noneletrolytes.
1. Introduction
It is well-known that saturated hydrocarbons are important
constituents of petroleum products. Anthropogenic activity
associated with the use of these compounds in chemical industry
and in energy generation releases hydrocarbons into the
environment.
1
The aqueous solubility of these compounds is
an important molecular property, playing a large role in the
behavior of compounds in many areas of interest. In modeling
the environmental impact of a contaminant, alone with the soil-
water absorption coefficient, the solubility is a key term in the
understanding of transport mechanisms and distribution in water.
The petroleum and petrochemical industries require this infor-
mation for estimating the partition of hydrocarbons between
aqueous and organic phase
2,3
and for minimizing the presence
of hazardous solutes in aqueous effluents.
4
Environmental
chemistry and engineering also need the data for modeling of
the transport and fate of hydrocarbon pollutants in the environ-
ment
5,6
and for the remediation of sites contaminated by
petroleum spills.
7,8
The environmental risk for using these
compounds should be assessed because these types of com-
pounds are often the most long-lived of environmental con-
taminants due to their comparatively low level of biodegrad-
ability when compared to oxygen or nitrogen containing
compounds. However, experimental solubility data are rather
scarce for saturated hydrocarbons with 10 or more carbon atoms.
Whereas a general equation would be of greatest use, the present
study is limited to hydrocarbons which were expected to be
advantageous in obtaining a significant correlation, as the
elimination of compounds that will undergo specific interactions
with water, such as hydrogen bonding, simplifies the nature of
the interactions that must be accounted for. Property of
hydrocarbons in water saturated with salt is useful upon its
contact with seawater. Given the importance of solubility, a
potential theoretical method for predicting the solubility is
desired, as many compounds exist for which the solubility
simply is not available.
Quantitative structure-property relationships (QSPR) studies
have been demonstrated to be an effective computational tool
in understanding the correlation between the structure of
molecules and their properties.
9-11
In a QSPR study, one seeks
to find a mathematical relationship between the property and
one or more descriptors. Thus, this study can indicate which of
the structural factors may play an important role in the
determination of a property. Furthermore, its advantage over
other methods lies in the fact that the descriptors used can be
calculated from the structure alone and are not dependent on
any experimental properties. However, the main problems
encountered in this kind of research are still the description of
the molecular structure using appropriate molecular descriptors
and selection of suitable modeling methods. At present, many
types of molecular descriptors such as constitutional, topological,
geometrical, electrostatic, and quantum chemical descriptors
have been proposed to describe the structural features of
molecules.
12-14
The same as the diversity of molecular descrip-
tors many different chemometrics and chemoinformatics meth-
ods, such as multiple linear regression (MLR), principal
component regression (PCR), partial least squares (PLS),
different types of artificial neural networks (ANN), and genetic
* Corresponding author. Tel.: +86-931-891-2578. Fax: +86-931-891-
2582. E-mail address: xyzhang@lzu.edu.cn.
²
Department of Chemistry, Lanzhou University.
‡
Department of Computer Science, Lanzhou University.
§
Universite ´ Paris.
3485 J. Phys. Chem. A 2005, 109, 3485-3492
10.1021/jp0501446 CCC: $30.25 © 2005 American Chemical Society
Published on Web 03/30/2005