Zhu JIANG*, Hui-yan WANG, Wen-wu SONG
School of Energy and Environment, Xihua University, Chengdu 610039, P. R. China
Discharge estimation based on machine learning
Zhu JIANG*, Hui-yan WANG, Wen-wu SONG
School of Energy and Environment, Xihua University, Chengdu 610039, P. R. China
To overcome the limitations of the traditional stage-discharge models in describing the dynamic characteristics of a river, a machine learning method of non-parametric regression, the locally weighted regression method was used to estimate discharge. With the purpose of improving the precision and efficiency of river discharge estimation, a novel machine learning method is proposed: the clustering-tree weighted regression method. First, the training instances are clustered. Second, the k-nearest neighbor method is used to cluster new stage samples into the best-fit cluster. Finally, the daily discharge is estimated. In the estimation process, the interference of irrelevant information can be avoided, so that the precision and efficiency of daily discharge estimation are improved. Observed data from the Luding Hydrological Station were used for testing. The simulation results demonstrate that the precision of this method is high. This provides a new effective method for discharge estimation.
stage-discharge relationship; discharge estimation; locally weighted regression; clustering-tree weighted regression; k-nearest neighbor method
The commonly used discharge test technology is complicated and expensive, and cannot be observed continuously. On the other hand, it is relatively easy to observe the average stage continuously. Therefore, discharge is usually calculated on the basis of stage and the stage-discharge relationship. The stage-discharge relationship curve is utilized to describe the relationship between the water stage of a cross-section and the discharge passing through the cross-section. In the process of planning, design, and construction of hydraulic engineering projects, the estimation and prediction of stage and discharge have always been considered an important subject (Lu 2006; Feng and Chen 2004; Feng et al. 1996).
The precision of hydrological computation, forecast, and analysis is susceptible to the quality of discharge data. For this reason, the stage-discharge relationship model has always received attention. A given stage or discharge is incompletely reliable due to the influence of fluctuating backwater, flood fluctuations, and other factors. Errors in individual data points may be very large. A common method of seeking a certain pattern in a pile of messy data is toestablish relational expressions among them (Dai et al. 2010), which can felicitously reflect the stage-discharge relationships in natural rivers. Scholars have proposed many traditional stage-discharge models, such as the exponential type and polynomial type (French et al. 1992).
Propagation of flow becomes more and more complex with the degree of randomness. Although it can be described by deterministic equations in specific conditions, a river is a nonlinear, strongly correlated, and highly complicated dynamic system, due to comprehensive effects of precipitation and human activities (Behzad et al. 2009). Therefore, traditional stage-discharge models may not be suitable for representing modern flow characteristics. In recent years, research on new modeling methods based on new scientific theories has become one of the important aspects of estimation and prediction. As a non-parametric learning method, machine learning does not require a clear assumption to define the complete objective function over the entire sample space. Instead, it can locally approximate each sample through the building of different objective functions (Zhu 2002). Therefore, a machine learning method of non-parametric regression, the locally weighted regression method, is used to establish the relationship between the discharge and stage, and then the discharge can be estimated. In order to improve the precision of estimation, first of all, a novel algorithm is proposed: the clustering-tree weighted regression method. After clustering the training instances, the k-nearest neighbor method is used to cluster new stage samples (for testing) into the best-fit cluster. Following this, the daily discharge is estimated. Finally, the observed data are utilized for comparison between the locally weighted regression method and clustering-tree weighted regression method.
2.1 Stage-discharge relationships
Stage-discharge relationships refer to the empirical relationships between the discharge y passing through a cross-section and the corresponding stage x. Different scholars have proposed different expressions for stage-discharge relationships. In general, the relationships have two types. One is the power law type (French et al. 1992): where a is a coefficient, and b is an exponent.
The linear relationship between discharge and average stage can be obtained through natural logarithm transformation of Eq. (1):
The other stage-discharge relational expression has the polynomial form commonly used:
where c0, c1, c2,…, cpare undetermined coefficients, xq(q=1,2,…,p) is the value of the stage minus a constant, and p is the highest order of the polynomial.
The polynomial form has been widely used, because the fitted curve is consistent with hydrological features at most observation stations (Dai et al. 2010).
For these reasons, machine learning is used to establish the non-parametric relationships between the discharge and stage. In this study the discharge was estimated to provide reliable data for hydrological calculation, prediction, and analysis.
2.2 Locally weighted regression
Locally weighted regression (LWR) is a non-parametric regression algorithm, first proposed by Cleveland (1979). Cleveland and Devlin (1988) extended it to multi-variable cases. The basic idea of this method is to use weighted least squares to locally fit a polynomial function at each point in the independent variable space, and to use this polynomial function as the estimation of the regression function at this point. The open method rather than a ready-made formula was adopted to calculate the relationships between variables with LWR. The fitted curve can properly present small changes of the relationship between different variables (Cleveland 1979).
When we analyze the relationship between discharge and stage, the LWR method can be achieved through the following steps:
(1) A space including water stage points is formed. The width of the space is described by
where z is the number of observation parameters participating in the process of the regression, l is the proportion of the number of observation parameters taking part in the regression to the total number of the observation parameters, and m is the number of observational data.
(2) The weights of all stage points in the space are defined. The weight of any point is the height of a weight function at this point. The weight function of locally weighted regression has many types. The choice of the weight function does not affect the calculation precision materially, so a common weight function is selected:
where u is a weight function variable. The weight for a point (yi, xi) is
(3) A polynomial is fitted to each point in an independent variable space using the weighted least squares algorithm.
(4) The estimated value of discharge is acquired.
The process of the LWR method is completed after the four procedures, and the estimated value of discharge can be obtained at last.
2.3 Clustering-tree weighted regression
Cluster analysis is a set of methodologies for automatic classification of samples into several groups using a measure of association, so that the samples in the same group are similar, while samples in different groups are quite discrepant from one another (Castro et al. 2004). The essence of the clustering process is an optimization, namely, helping the objective function of the system to reach a minimum value using a rapid algorithm. The clustering process is mainly focused on dividing large numbers of samples into several classes on the basis of similarities; in the meantime, the prediction of a specific sample in an unknown domain is not involved. Therefore, it is not restrained by people’s prior knowledge, and the original information of the data collection can be obtained eventually.
The clustering methods mainly include the division method, hierarchical method, density-based method, and grid-based method (Shi et al. 2007). The basic idea of the hierarchical clustering method (Mitchell 2003) is to determine two classes that are most alike by establishing and updating the distance coefficient matrix (or a similar coefficient matrix) gradually, and to merge the two classes into one class. Since the entire process can be expressed as a binary hierarchical tree, the hierarchical clustering method is also a clustering-tree method, with advantages such as clarity in the polymerization process, a high degree of visualization, independence from the initial arrangement of samples, and stability in the clustering results. Nevertheless, it is undeniable that this method also suffers from difficulty in class reconstruction.
Aimed at improving the precision and efficiency of discharge estimation, this paper proposes a novel machine learning method, the clustering-tree weighted regression method. The specific process of estimating the stage-discharge relationships using this algorithm can be summarized as follows:
where lijis the distance from the point sjto the point rfi, and sjrepresents the jth cluster center of Gf.
Once the objective function is determined, the following tasks can be completed: (a) k cluster centers are randomly selected from n training instances. (b) The nearest cluster center to each point rfi(river stage and discharge) is sought, and then rfiis put in the cluster. (c) The objective function E is computed. If the value of E does not change, the results of the clustering are stable. In that case, the process should proceed to step (e), or else to step (d). (d) The best clustering center is generated by a fixed E, and then the process returns to step (b). (e) The distance coefficient matrix is established and gradually updated, every two nodes with maximum similarity are determined, and then they are merged into a new node until only one node is left.
Step 2: The new stage samples are assigned to the most appropriate cluster using the k-nearest neighbor method. The value of k can be determined by several experiments. The minimum classification error rate will be obtained from the k value. The k-nearest neighbor method determines k samples which are the closest to the unknown samples in the multi-dimensional space, and it also judges the category of the unknown instances according to the characteristics of these k samples (Behzad et al. 2009). When a new stage sample is given, the k-nearest neighbor classifier uses the Euclidean distance to search k stage samples closest to the new sample from the clusters. The Euclidean distance between two stage samples xiand xjis defined as follows:
where ar(xi) is the value of the rth attribute of the stage sample xi.
Step 3: After each new stage sample is assigned to a cluster, the river discharge is estimated using the LWR method (which is trained with data only from the relevant cluster).
3.1 Experimental data
The Dadu River is the largest tributary of the Minjiang River. It is very important to understand its hydrology in order to fully realize the whole benefit of the Dadu River Basin cascade development. Determination of the stage-discharge model is the first task of hydrological data compilation. The Luding Hydrological Station is one of the national basic stations, and it is located on the main stream of the Dadu River. The daily mean stage and discharge data at the Luding hydrological station from 2007 to 2010 were used, as shown in Fig. 1. The data in the first three years were used as the training samples, and the data in 2010 were used as the testing samples.
3.2 Estimation of discharge for stage-discharge model
While using the clustering-tree weighted regression method, we first need to determinekcluster centers. The process of determining the value ofkis as follows: First, the hierarchical clustering method is used to get a pedigree chart. Second, the value ofkis roughly determined from the chart. Finally, a few values near thekvalue are tested to select the best one. In this study, the value ofkis equal to 4. Four cluster centers were selected from the training samples through simulation experiments, as shown in Fig. 2.
Fig. 1Observed daily stage and discharge data
Fig. 2Four cluster centers
The performance of each approach is assessed using the root mean square error (RMSE):
where subscripts e and o denote estimated and observed data, respectively.
Fig. 3 shows theRMSEvalues in estimating the daily discharge with the LWR method and the clustering-tree weighted regression method, which demonstrates that the precision of the latter method proposed in this paper is superior to that of the former one.
Fig. 3RMSEvalues of two methods
The estimated results of daily discharge using these two methods are shown in Fig. 4. From the estimated results in Fig. 4(a) and theRMSEvalues of the LWR method shown in Fig. 3,it can be seen that there is a large deviation between the estimated daily discharge and the observed one when the discharge is low. On the other hand, the estimated results can be improved when the discharge increases. Fig. 4(b) and Fig. 3 show that the clustering-tree weighted regression method can effectively improve the large deviation between the estimated and observed daily discharge under normal climate conditions (e.g., when there are no floods, no heavy rainfall, etc.) and without special water storage requirements. This was achieved by first clustering the hydrological data of the river as the training samples, and then assigning the new stage samples into proper classes with the k-nearest neighbor method. In this way, the interference of other irrelevant information can be avoided, and thus the efficiency and precision of daily discharge estimation can be improved. It can be seen from Fig. 3 that when the discharge was very large, the estimation accuracy of the proposed method was lower than that of the LWR method. This is because the proposed method is based on the cluster. Heavy rainfall rarely occurred, and the training samples included few large discharge values, so the accuracy of the proposed model decreased. Without extreme weather events, the proposed method has practical significance in accurately capturing the hydrological changes of the river.
Fig. 4 Estimated discharge with LWR and clustering-tree weighted regression methods
The movement of water is a complicated process involving a large number of random factors. The traditional stage-discharge models were established on the basis of empirical regression, which is not suitable for complicated river flow characteristics.
Both the LWR method and a novel clustering-tree weighted regression method were used to estimate the discharge of the stage-discharge models. The observed data of the Luding Hydrological Station were used to verify these two methods. The RMSE values and predicted results of daily discharge both show that under normal climate conditions the clustering-tree weighted regression method has a higher accuracy than the LWR method, and can accurately capture changing characteristics of dynamic flow.
In our future work, other non-parametric techniques that can also be applicable to parametric calibration will be investigated. To further validate the proposed approach in this paper, future tests will be carried out on data collected from different rivers.
Behzad, M., Asghari, K., Eazi, M., and Palhang, M. 2009. Generalization performance of support vector machines and neural networks in runoff modeling. Expert Systems with Applications, 36(4), 7624-7629. [doi:10.1016/j.eswa.2008.09.053]
Castro, R. M., Coates, M. J., and Nowak, R. D. 2004. Likelihood based hierarchical clustering. IEEE Transaction on Signal Process, 52(8), 2308-2321. [doi:10.1109/TSP.2004.831124]
Cleveland, W. S. 1979. Robust locally weighted regression: An approach to regression analysis by local fitting. Journal of the American Statistical Association, 74(368), 829-836.
Cleveland, W. S., and Devlin, S. J. 1988. Locally weighted regression: An approach to regression analysis by local fitting. Journal of the American Statistical Association, 83(403), 596-610. [doi:10.2307/2289282]
Dai, L. Q., Dai, H. C., Jiang, D. G., Li, H., and Chen, X. Y. 2010. Calculation of stage-discharge relationship curve based on least square method. Yellow River, 32(9), 37-39. (in Chinese)
Feng, G. Z., Wang, S. Y., and Wei, H. Y. 1996. Application of the multivariate autoregressive model to low flow forecast. Journal of Natural Resources, 11(2), 184-186. (in Chinese)
Feng, H. Z., and Chen, Y. Y. 2004. A new method for non-linear classify and non-linear regression, II: Application of support vector machine to weather forecast. Journal of Applied Meteorological Science, 15(3), 355-365. (in Chinese)
French, M. N., Krajewski, W. F., and Cuykendall, R. R. 1992. Rainfall forecasting in space and time using a neural network. Journal of Hydrology, 137(1-4), 1-31. [doi:10.1016/0022-1694(92)90046-X]
Lu, M. 2006. Research on the SVM application of runoff forecast. China Rural Water and Hydropower, (2), 47-49. (in Chinese)
Mitchell, T. M. 2003. Machine Learning. Beijing: China Machine Press. (in Chinese)
Shi, K. P., Mu, G., Li, T., and Lü, L. 2007. Empirical mode decomposition based clustering-tree method and its application in coherency identification of generating sets. Power System Technology, 31(22), 21-25. (in Chinese)
Zhu, M. 2002. Data Mining. Hefei: University of Science and Technology of China Press. (in Chinese)
(Edited by Yan LEI)
This work was supported by the Key Fund Project of the Sichuan Provincial Department of Education (Grant No. 11ZA009), the Fund Project of Sichuan Provincial Key Laboratory of Fluid Machinery (Grant No. SBZDPY-11-5), and the Key Scientific Research Project of Xihua University (Grant No. Z1120413).
*Corresponding author (e-mail: HILL5525@163.com)
Received Dec. 5, 2011; accepted Jun. 9, 2012
Water Science and Engineering2013年2期