JIAO Long LEI Bin QU Le LI Rui,c YAN Chun-Hu LI Hong
a (College of Chemistry and Chemical Engineering, Xi’ an Shiyou University, Xi’ an710065, China)
b (Shaanxi Cooperative Innovation Center of Unconventional Oil and Gas Exploration and Development, Xi’an Shiyou University, Xi’ an710065, China)
c (College of Petroleum Engineering, Xi’ an Shiyou University, Xi’ an710065, China)
ABSTRACT The combination of hologram quantitative structure-activity relationship (HQSAR) and consensus modeling was employed to study the quantitative structure-property relationship (QSPR) model for calculating the aqueous hydroxyl radical oxidation reaction rate constants (kOH) of organic micropollutants (OMPs). Firstly,individual HQSAR model were established by using standard HQSAR method. The optimal individual HQSAR model was obtained while setting the parameter of fragment distinction and fragment size to “B” and “3~6”respectively. Secondly, consensus HQSAR model was established by building the regression model between the kOH and the hologram descriptors with consensus partial least-squares (cPLS) approach. The obtained individual and consensus HQSAR model were validated with a randomly selected external test set. The result of external test set validation demonstrates that both individual and consensus HQSAR model are available for predicting the kOH of OMPs. Compared with the optimal individual HQSAR model, the established consensus HQSAR model shows higher prediction accuracy and robustness. It is shown that the combination of HQSAR and consensus modeling is a practicable and promising method for studying and predicting the kOH of OMPs.
Keywords: QSPR, hologram QSAR, consensus modeling, organic micropollutants, hydroxyl radical, rate constant; DOI: 10.14102/j.cnki.0254–5861.2011–3083
Hologram quantitative structure-activity relationship(HQSAR) is an ingenious and efficient quantitative structure-property relationship (QSPR) technique, which proposes a specialized fragment fingerprint, known as molecular hologram (MH), as the structural descriptor to build a QSPR model. Because the HQSAR model is easier-to-built than many other QSPR models, as well as possessing comparable prediction accuracy, it has been successfully applied to a number of QSPR researches in the fields of biology[1,2], pharmacology[3-6], chemistry[7,8],environmental science[9,10], etc. Traditionally, HQSAR method builds individual regression models between molecular properties and hologram descriptors by using partial least-squares (PLS) regression. As is known to all,individual regression models tend to underfitting or overfitting[11]. By contrast, consensus modeling method can overcome this shortcoming to a great extent through integrating several individual models[12-14]. The predictive accuracy and robustness of regression models could be improved by consensus modeling. As a powerful and reliable modeling strategy, consensus modeling has been successfully applied to lots of research fields, such as QSPR modeling, spectral analysis, machine learning, artificial intelligence and so on[15-17]. Obviously, it is necessary and advisable to introduce consensus modeling into HQSAR modeling in order to build more accurate and robust models.
Organic micropollutants (OMPs) as a group of compounds that cover a wide array of physical-chemical properties have been identified as emerging contaminants due to the possible threats to ecological environments. In recent decades, contamination of OMPs on surface water has received increasingly scientific and public awareness[18].OMPs has the characteristics of low concentration and high toxicity, which can cause direct or potential harm to aquatic ecosystems and human health. Therefore, with the progress of science and the enhancing attention about human health,technology is needed to remove these pollutants from wastewater effluents prior to the discharge of wastewater to the environment. Recently, ozone has been used to process OMPs in wastewater. According to Hoigne and Bader[19],there are two ways for ozone reacting with organic pollutants in water: (1) direct reactions; (2) indirect reactions of hydroxyl radicals produced by the process of ozone decomposition. The rate constants of direct reaction could be easily determined by experiments[20]. However, due to the complexity of analytical methods, experimentally determining the aqueous oxidation reaction rate constants of hydroxyl radical with OMPs is always a time-consuming,costly and hard task[21-25]. Hence, the QSPR method has been extensively used to predict the hydroxyl radical rate constant of the contaminants by relating the properties (rate constant)of contaminants with their molecular structures[26-30]. Several 2D-QSPR models have been proposed in many literatures for studying the rate constant of hydroxyl radical on the basis of quantum chemical or topological descriptors[31-33].However, the modeling processes of these models are always time-consuming and complex, and it is always meaningful to improve the accuracy and robustness of these models. Thus, the QSPR model of the aqueous hydroxyl radical oxidation reaction rate constant of OMPs (kOH) was studied in this work, based on the HQSAR and consensus modeling method.
The experimental aqueous hydroxyl radical oxidation reaction rate constants of the investigated 83 OMPs was collected from reference[34]. The 83 OMPs were randomly divided into two sample sets, training set and test set, in the light of 2:1. The training set, which was used to establish and optimize the HQSAR model, includes 55 samples. The test set, which was utilized to assess the prediction performance of the developed QSPR models, of course comprises the other 28 samples.
All the computations were carried out in ani5-4258U/4GRAM personal computer. The computations related to HQSAR modeling were performed in SYBYL-X 2.0 software (Certara, U.S.). Other computations were performed with the program developed by our research team.
HQSAR is an excellent 2.5D-QSPR approach proposed by Hurst et al.[41,42]which contains the advantages of both 2D-QSPR methods and 3D-QSPR methods. The notable advantage of HQSAR is it can rapidly and automatically process large data set with high prediction accuracy and statistical quality. Compared with 3D-QSPR methods,conformation optimization or alignment of molecules is not required in HQSAR. HQSAR is an ingenious and successful combination of molecular hologram descriptors and PLS regression methods.
MH is an extended form of molecular fingerprint, a kind of fragment-based descriptor which translates chemical structure representations into binary bit strings. It can code more structural information than traditional 2Dmolecular fingerprint, such as stereo-chemical structure, branching and cyclic fragments. All possible molecular fragments,including linear, branched, cyclic, and overlapping features within a molecule, could be contained in MH. MH is actually an array containing counts of molecular fragments.In MH, the molecular fragments are described with Sybyl Line Notation (SLN), a specification for explicitly characterizing molecular fragments, structures, structural libraries, reactions, formulations, molecular and reaction queries by using short ASCII strings.
Two parameters,fragment distinctionandfragment size,are used to set the type and length of MH descriptors. The parameter offragment distinctiondefines the type of fragments, including atoms (A), bonds (B), connections (C),hydrogen atoms (H), chirality (Ch), and donor and acceptor atoms (DA)[43,44]. Different types of fragments could be combined. For example, the default setting offragment distinctionis “A/B/C” in SYBYL. Theparameter offragment sizeis used to specify the length of fragments. All the possible fragments are generated withSatoms[45,46].Here,Sis an integer betweenMandN. The value ofMshould be larger than 2 and smaller thanN. The values ofNis usually larger than 12 and does not exceed the number of atoms in the molecule. This parameter is set to “4~7” by default in SYBYL. After setting the parameter offragment distinctionandfragment size, each fragment was assigned to a unique integer in the range of 0~231 using a cyclic redundancy check (CRC) algorithm[47]. Each integer corresponds to a bin in an integer array of fixed lengthL,which represents the length of MH. In the HQSAR module of SYBYL software,Lusually is one of the 12 prime numbers ranging from 53 to 401. The initial setting ofLis 97, 151, 199, 257, 307 and 353. The terms of molecular bit string fingerprint involve “0”, which usually does not have any useful information. In the subsequent PLS modeling step,the computation time may be dramatically increased with the increase of fingerprint length. More importantly, these null values may hinder the follow-up computation of PLS model.Therefore, it is necessary to adopt effective method for reducing the length of fingerprint. This reduction is achieved through the process called “hashing”, which allocates multiple fragments to the same location in a fingerprint[48].
The idea behind consensus modeling is building a series of models, namely member models, with different training subsets, which consists of different samples randomly selected from one training set, and combining the eligible member models according to the consensus rules. A consensus model always contains member models with different prediction characteristics. The most significant advantage of consensus modeling is that it is able to resist underfitting and overfitting to a certain extent, and thus can improve the robustness and predictability of a regression model. The flow chart of consensus modeling is shown in Fig.1.
Fig.1. Flow chart of consensus modeling
Krogh and Vedelsby[49]proposed the prediction error decomposition theory of consensus models and expressed the theory as follows:
Consensus partial least squares (cPLS) is a commonly used consensus modeling method[12,14,50,51]. Its basic idea is disturbing the training set by random sampling, establishing a series of individual PLS models, and selecting appropriate member models from these individual PLS models to jointly predict the unknown samples. The main steps of cPLS includes:
(1) Setting the training subset and inspection set;
(2) Setting the total number of individual PLS models;
(3) Building the individual PLS models with the training subset, and predicting the inspection set with the obtained individual PLS models;
(4) Determining whether the individual PLS model established in step (3) could be accepted as member model of the consensus model, according to the prediction results of inspection set;
(5) Repeating steps (2)~(4) to find enough eligible member models;
(6) Combining the prediction result of all the member models according to the fusion criteria, such as calculating the mean value, to build the cPLS model.
Table 1. Statistics of the kOH Models with Different Fragment Distinctions
Table 2. Statistical Results of the kOH Models with Different Fragment Sizes
Table 3. Statistics of the Individual and Consensus HQSAR Model
Fig.2. Predicted kOH versus experimental kOH: (a) individual HQSAR model, (b) consensus HQSAR model.“▲” indicates the samples of training set, and “▼” are those of the test set
In this section, cPLS was employed to build consensus HQSAR model. Correspondingly, the 55 samples of training set was randomly divided into a training subset and an inspection subset, according to the ratio of 2:1. The training subset, which was used to build the regression model,comprises of 37 samples and the inspection subset, which was used to optimize the number of member models,includes the rest 18 samples.
The QSPR models for predicting the aqueous oxidation reaction rate constants of organic micropollutants with hydroxyl radical were successfully established by using HQSAR approach combined with consensus modeling method. The result of external test set validation indicates that both individual HQSAR model and consensus HQSAR model is practicable for describing the quantitative relationship between the structural information andkOHof the investigated organic micropollutants. Compared with individual HQSAR model, the established consensus HQSAR model has higher prediction accuracy and robustness. It is demonstrated that consensus HQSAR modeling is a practicable and promising approach for improving the accuracy and robustness of HQSAR model.And the established consensus HQSAR model is an easy-to-use and accurate model for studying and predicting the aqueouskOHof the organic micropollutants oxidation reactions.