亚洲免费av电影一区二区三区,日韩爱爱视频,51精品视频一区二区三区,91视频爱爱,日韩欧美在线播放视频,中文字幕少妇AV,亚洲电影中文字幕,久久久久亚洲av成人网址,久久综合视频网站,国产在线不卡免费播放

Identification of important factors influencing nonlinear counting systems?

2022-02-18 14:00:46XinminZHANGJingboWANGChihangWEIZhihuanSONG

Frontiers of Information Technology & Electronic Engineering 2022年1期

Xinmin ZHANG, Jingbo WANG, Chihang WEI, Zhihuan SONG

State Key Laboratory of Industrial Control Technology, College of Control Science and Engineering,Zhejiang University, Hangzhou 310027, China

Abstract: Identifying factors that exert more influence on system output from data is one of the most challenging tasks in science and engineering. In this work, a sensitivity analysis of the generalized Gaussian process regression(SA-GGPR) model is proposed to identify important factors of the nonlinear counting system. In SA-GGPR, the GGPR model with Poisson likelihood is adopted to describe the nonlinear counting system. The GGPR model with Poisson likelihood inherits the merits of nonparametric kernel learning and Poisson distribution, and can handle complex nonlinear counting systems. Nevertheless, understanding the relationships between model inputs and output in the GGPR model with Poisson likelihood is not readily accessible due to its nonparametric and kernel structure. SA-GGPR addresses this issue by providing a quantitative assessment of how different inputs affect the system output. The application results on a simulated nonlinear counting system and a real steel casting-rolling process have demonstrated that the proposed SA-GGPR method outperforms several state-of-the-art methods in identification accuracy.

Key words: Important factors; Nonlinear counting system; Generalized Gaussian process regression; Sensitivity analysis; Steel casting-rolling process

1 Introduction

Evaluating factors that have an impact on system output is critical for decision-makers to identify critical control points(CCPs). For example,the steel industry is committed to reducing defects in steel products based on the defect analysis critical control point (DACCP) system. One step in the DACCP system is to determine CCPs where defect management efforts can be focused.

The study of important factor identification from observational data is usually based on the supervised learning model. Supervised learning is a class of systems that determine a predictive model using labeled data (Mohri et al., 2018). Linear regression is a supervised learning technique typically used in predicting and finding relationships among quantitative data (Talabis et al., 2014; Sugiyama,2015). The popular linear regression model is partial least squares (PLS) regression, which has been widely used in various fields(Wold et al.,2001;Abdi,2010;Kano and Ogawa,2010;Shao and Tian, 2015;Ge et al., 2017;Zhang et al., 2017, 2019,2020a;Ge,2018). In PLS regression, PLS-Beta and PLS-VIP have been widely used to identify important factors(Wang et al., 2015). In PLS-Beta, the identification of important factors is based on the regression coefficients of the PLS model. PLS-VIP is based on the variable importance in the projection (VIP) score.Nevertheless, simple parametric models lack expressive power for complex nonlinear processes. Compared with simple parametric models, nonparametric regression models, such as random forest (RF)(Cutler et al.,2012)and Gaussian process regression(GPR) (Rasmussen and Williams, 2006), are more powerful in handling complex nonlinear processes.RF is a nonlinear ensemble learning method that constructs a number of decision trees on various subsamples of the dataset and uses averaging to improve the prediction accuracy. The identification of important factors in RF can be realized by the permutation importance criterion and out-of-bag(OOB)error estimates(referred to as RF-PI)(Biau,2012). GPR is a kernel-based nonlinear regression method. Because the implicit feature mapping is used in GPR through the kernel function, the evaluation of important factors in GPR is not easily accessible. To solve this issue, Blix et al. (2017) proposed the Gaussian process sensitivity analysis and applied it to solve the oceanic chlorophyll problem. Zhang et al. (2020b)proposed a GPR and Hilbert-Schmidt independence criterion based identification method. However, the GPR model is designed for continuous real-valued outputs with a Gaussian assumption,which does not hold in some engineering application studies. For example, causal analysis of defects in steel products is to discover the factors that affect the number of defects, which is the count data output; the Gaussian assumption is invalid and the GPR model cannot be directly applied.

In this work,a novel method,called the sensitivity analysis of the generalized Gaussian process regression (SA-GGPR) model, is proposed to identify important factors of the nonlinear counting system.In SA-GGPR, the GGPR model with Poisson likelihood is adopted to describe the nonlinear counting system. The GGPR model with Poisson likelihood inherits the merits of nonparametric kernel learning and Poisson distribution,and can deal with complex nonlinear counting systems. Nevertheless, for the GGPR model with Poisson likelihood, the identification of model inputs that have a significant effect on the system output is not easily accessible due to its nonparametric and kernel structure. SA-GGPR deals with this issue by providing a quantitative assessment of how different inputs affect the system output in terms of sensitivity measure. The proposed method is first validated on a simulated nonlinear counting system and then applied to a real steel casting-rolling process. The results demonstrate the feasibility and reliability of the proposed SA-GGPR method.

2 Conventional methods

In this section, brief descriptions of PLS-Beta,PLS-VIP, and RF-PI are presented. PLS-Beta and PLS-VIP are widely used to identify important factors of linear systems,while RF-PI is widely used for nonlinear systems.

2.1 PLS-Beta

PLS regression is a popular supervised learning method that predicts the system output from a set of inputs by constructing a latent variable model(Abdi, 2010). Consider a training dataset with inputX ∈RN×Mand outputy ∈RN, whereNandMrepresent the number of samples and the number of input variables, respectively. In PLS regression,X ∈RN×Mandy ∈RNare decomposed as

whereβpls∈RMis a regression coefficient vector,indicating the importance of each input in describing the output. The absolute value ofβplsis employed in PLS-Beta to identify important factors.

2.2 PLS-VIP

PLS-VIP identifies important factors in terms of VIP score,which measures the importance of each input in the projection used in a PLS model (Wang et al., 2015). The VIP score of themthvariable is expressed as

wheretrandwrdenote therthcolumn vectors ofTandW,respectively,qrrepresents therthelement ofq, andwm,rrepresents themthelement ofwr.

2.3 RF-PI

RF(Cutler et al.,2012)is a nonlinear ensemble learning method describing the input-output relationship by constructing a set of decision trees. Each tree is built on the bootstrap subset of the dataset.During the tree-growing process, the best split of each node is calculated from the randomly selected subset of the total input variables.

RF uses the permutation importance criterion(referred to as RF-PI) to identify important factors of nonlinear systems (Cutler et al., 2012). The idea of RF-PI is that if one input variable is not important, the model accuracy will not deteriorate when the value of that input variable is permuted. Mathematically, the importance score VImfor themthinput variable is calculated by averaging the difference in OOB errors before and after the permutation over all trees(Bühlmann,2012). Let ?fb(·)denote the tree grown on thebthbootstrap subset (b= 1,2,...,B)and let OOBbdenote the OOB observation corresponding to thebthbootstrap subset. A step-by-step procedure for calculating VImis presented as follows:

4. Repeat steps 1-3 forb=2,3,...,B.

5. Calculate VImas

3 Sensitivity analysis of generalized Gaussian process regression

In this section, a new method called SA-GGPR is presented to identify important factors of the nonlinear counting system. In SA-GGPR, the GGPR model with Poisson likelihood is adopted to describe the nonlinear counting system. The GGPR model with Poisson likelihood inherits the merits of nonparametric kernel learning and Poisson distribution,and can deal with complex nonlinear counting systems. However, it is not intuitive to understand the relationship between model inputs and output in GGPR with Poisson likelihood due to its nonparametric kernel structure. To solve this problem, SAGGPR is proposed in this work. SA-GGPR determines the factors that have a significant effect on the system output in terms of the sensitivity measure.

3.1 Generalized Gaussian process regression

GGPR constructs flexible nonparametric Bayesian models in which the observation likelihood is parameterized by an exponential family distribution (EFD) and the latent function is related to the output distribution via a link function (Chan and Dong, 2011). Specifically, GGPR consists of the following three components:

1. Random component

The output variableyfollows an EFD, with a probability density function (or probability mass function)taking the form of

3.2 SA-GGPR

Although GGPR is a flexible nonparametric Bayesian regression model, the evaluation of important factors in GGPR is not easily accessible due to the implementation of implicit feature mapping. SAGGPR is proposed to solve this problem. SA-GGPR determines how different values of an input variable affect the output. Mathematically, the measure of the sensitivity of variablemis given as

whereφ(x) denotes the objective function andp(x)is the probability density function. The calculation of sensitivity involves taking the partial derivative ofφ(x) with respect to the input factorxxxm. In this work, the predictive mean functionμfof GGPR is specified as the objective functionφ(x). To simplify the calculation,the objective function is rewritten as

whereαηrepresents a weight vector. Then, an empirical estimate ofsmcan be calculated by

It is worth noting that the values of the positive definite diagonal matrix ?Wand the target vector ?tin Eq. (20) depend on the choice of the type of the observation likelihood. Because the focus of this work is on the identification of important factors of the counting system,the Poisson likelihood is selected. As a discrete probability distribution, the Poisson likelihood is suitable for applications that involve counting the number of occurrences of random events (Hutchinson and Holtman, 2005;Coxe et al.,2009). The probability mass function of Poisson likelihood is defined as

whereλdenotes the mean number of events (also known as the shape parameter or rate parameter).

As mentioned in Eq.(6),the exponential family generalizes a wide variety of distributions by changing the likelihood parameters. For GGPR with Poisson likelihood,the parameterθ=lnλ,the dispersionφ= 1, and the parameter functions in the exponential family form are

According to the Taylor approximation inference of GGPR with Poisson likelihood(Nickisch and Rasmussen,2008),the target elementstiand diagonal elementswican be calculated as

wherec ≥0(e.g.,c=0.001)is a constant to prevent from taking the logarithm of zero.

A step-by-step procedure for implementing the proposed SA-GGPR algorithm is summarized in Algorithm 1. The model hyperparameters (kernel parameters) in Algorithm 1 are optimized by maximizing the marginal likelihood using the GPML toolbox (Rasmussen and Nickisch, 2010).The SA-GGPR codes can be downloaded from https://github.com/IBD-CSE/SAGGPR.

Algorithm 1 SA-GGPR Input: input data matrix X ∈RN×M and output data vector y ∈RN. The kernel (covariance) function is squared exponential kernel and the observation likelihood function is Poisson likelihood Output: the importance score VIm (m=1,2,...,M)1: Construct the GGPR model and obtain the objective function φη(x) = kq(K + ?W)?1?t = kqαη =∑N p=1 αη,pk(xp,xq)2: for m=1,2,...,M do 3:Calculate the measure of sensitivity of variable m:?sm = 1 N λ2k(xp,xq)■2 4: end for 5: Calculate the importance score of variable m: VIm =?sm/∑M j=1 ?sj∑N q=1■∑N αη,p(xp,m?xq,m)p=1

4 Case study

In this section, we apply the proposed SAGGPR method to a simulated nonlinear counting system and a real steel casting-rolling process. The application results are compared with those of the PLS-Beta, PLS-VIP, RF-PI, and SA-GPR methods in terms of identification accuracy. In SA-GPR, the standard GPR using Gaussian likelihood is adopted.

4.1 Numerical example

4.1.1 Data generation

Data is generated from the following nonlinear counting system:

wherex1-x5are input variables,μdenotes the mean of the output variable distribution,yis the output variable,tis uniformly distributed within [?2,2],andεis a Gaussian measurement noise with zero mean and a standard deviation of 0.1.

Note that the output variableyis discrete count data, and that the important factors or variables affecting the output variable arex1,x4, andx5.

4.1.2 Performance measure

To evaluate the prediction performance of each method, the root mean squared error (RMSE) and correlation coefficientRare used. RMSE andRare calculated as

whereyiand ?yirepresent the actual observed value and the predicted value respectively, ˉyand ˉ?yrepresent the mean values ofyiand ?yirespectively, andNtrepresents the size of testing samples.

To evaluate the identification performance of each method for the important variables, a confusion matrix (also known as an error matrix) is employed. The confusion matrix reports information about the predicted and actual classes. Each column of the matrix represents the instance in a predicted class, while each row represents the instance in an actual class. The first three variables in the order of variable importance predicted by each method are classified as important variables. The remaining two variables at the bottom are classified as unimportant variables. Table 1 shows a confusion matrix,in whichc1-c4denote the numbers of variables identified by each method in each group. Three metrics that are calculated from the confusion matrix are commonly employed to evaluate the identification performance of each method quantitatively. They are defined as

4.1.3 Results and discussion

Using the above data generation process (simulation system), 2000 samples are generated. The whole dataset is divided into the training dataset and the testing dataset according to the 10-fold crossvalidation criterion. That is,the dataset is randomly divided into ten parts, nine of which are used for training and the remaining one for testing. This process can be repeated 10 times, and the testing data used is different each time. Table 2 shows the mean prediction accuracy of each method in terms of RMSEP (RMSE of prediction) andRcriteria. In PLS,the number of latent variables used is set at 3,which is determined by cross-validation. In RF, the number of trees is set at 500, which is determined by the OOB error criterion. In GPR and GGPR, the model hyperparameters are optimized by maximizing the marginal likelihood using the GPML toolbox(Rasmussen and Nickisch, 2010). From Table 2, it can be seen that GGPR is the most accurate modelamong all the methods. Thus, the implementation of SA-GGPR is feasible.

Table 3 shows the results of SA-GGPR in identifying the important factors for the above nonlinear counting system in terms of confusion matrix criterion. For comparison, the identification results ofPLS-Beta, PLS-VIP, RF-PI, and SA-GPR are also provided. In Table 3, the identification result is the average of 50 repeated experiments. From Table 3,it can be seen that PLS-Beta,PLS-VIP, RF-PI, and SA-GPR yield poor identification performance with low accuracy, recall,and selectivity. In comparison,the proposed SA-GGPR achieves the best identification performance with the highest accuracy, recall,and selectivity. The detailed identification results are given in Fig.1,where the results are shown visually in boxplots. In Fig.1,the green boxes represent important variable IDs and the white boxes represent unimportant variable IDs. The importance of variables is normalized so that the sum is one. As shown in Fig. 1,PLS-Beta,PLS-VIP, RF-PI, and SA-GPR cannot fully identify all important variables(x1,x4,andx5). In contrast, the proposed SA-GGPR successfully identified thatx1,x4,andx5are important variables, which is consistent with the experimental design.

Fig.1 Identification results of important factors by different methods in the numerical example: (a)PLS-Beta;(b) PLS-VIP; (c) RF-PI; (d) SA-GPR; (e) SA-GGPR. References to color refer to the online version of this figure

4.2 Steelmaking process

In this subsection, we apply the proposed SAGGPR method to solve a practical engineering problem and identify the most influential operating process variables that affect the number of defects in the steel plate.

The defect data contains 5000 samples and 71 process variables, and is collected from an industrial casting-rolling process. The input variables include the casting speed,rolling temperature,cooling temperature, and so on. The output variable is the number of surface defects in the steel plate, which is a count-type output. According to the knowledge and experience of experts, the important and unimportant variables are listed in Table 4. Similar to the numerical example, the confusion matrix criterion is employed to evaluate the identification performance of each method quantitatively. From Table 4,it can be seen that the actual classes include 28 important variables and 43 unimportant variables.For the predicted classes, the first 28 variables in the order of variable importance predicted by each method are classified as important variables,and the remaining 43 variables at the bottom are classified as unimportant variables. Based on Eqs. (30)-(32),three metrics (accuracy, recall, and selectivity) can be calculated.

Before implementing SA-GGPR, the accuracy of the GGPR model first needs to be evaluated. We randomly split the whole dataset into two parts. The first is the training dataset with 4500 samples and the second is the testing dataset with 500 samples.The training dataset was used to train the model,and the built model was then evaluated using the testing dataset. The above procedure was repeated20 times. Table 5 summarizes the average prediction error of each model. In PLS, the number of latent variables used was set at 35. In RF, the number of trees was set at 500. In GPR and GGPR, the model hyperparameters were optimized by maximizing the marginal likelihood using the GPML toolbox (Rasmussen and Nickisch, 2010). As shown in Table 5,GGPR is the most accurate model with the smallest RMSEP and the largestRamong all the methods.Thus,the implementation of SA-GGPR is feasible.

Table 4 Importance based on the knowledge and experience of experts

Table 6 shows the identification results of important factors by different methods in terms of the confusion matrix criterion. PLS-Beta,PLS-VIP,RFPI,and SA-GPR exhibited low accuracy,recall,and selectivity. As a consequence, the proposed SAGGPR had the best performance with the highest accuracy,recall,and selectivity. More detailed identification results of each method are shown in Fig.2.The green boxes represent important variable IDs and the white boxes represent unimportant variable IDs. It can be seen that the proposed SA-GGPR distinguished the important variables from the other variables more accurately and clearly than the other methods.

To investigate the computational cost of each method, Table 7 presents the comparison results inthe same computing environment, which is a desktop computer with Windows 10 (64 bit), Intel (R)Core (TM) i7-9700 CPU, 16 GB RAM, and MATLAB R2019b. From Table 7, it can be seen that PLS-Beta and PLS-VIP require shorter computing time than the other methods. The computing time used in RF-PI is longer than those in PLS-Beta and PLS-VIP, but it is shorter than those in SA-GPR and SA-GGPR. The computing time used in SAGGPR is shorter than that in SA-GPR. As a result,the proposed SA-GGPR obtains the highest identification accuracy without incurring the highest computational cost. It should be emphasized that in many cases,accuracy is more important than speed.Therefore, the proposed SA-GGPR can be widely used in many important factor identification cases.

Table 5 Prediction results by different methods in the casting-rolling process

Table 6 Important factors identified by different methods in the casting-rolling process

Table 7 Computational time comparison of different methods

Fig. 2 Identification results of important factors by different methods in the casting-rolling process: (a) PLSBeta; (b) PLS-VIP; (c) RF-PI; (d) SA-GPR; (e) SA-GGPR. References to color refer to the online version of this figure

5 Conclusions

In this research, the sensitivity analysis of the generalized Gaussian process regression(SA-GGPR)model is proposed to identify important factors of the nonlinear counting system. On one hand, the GGPR model with Poisson likelihood is adopted to describe the nonlinear counting system. The GGPR model with Poisson likelihood inherits the merits of nonparametric kernel learning and Poisson distribution,and can handle complex nonlinear counting systems. On the other hand, the identification of important factors for the nonlinear counting system is introduced using SA-GGPR. SA-GGPR implements a quantitative assessment of how different inputs affect the system output based on the sensitivity measure. The usefulness and advantages of SA-GGPR are verified by its application to a simulated nonlinear counting system and a real steel casting-rolling process. The application results show that the proposed SA-GGPR method is feasible and more accurate in identifying important factors of the nonlinear counting system compared with several state-of-theart methods.

Contributors

Xinmin ZHANG designed the research, processed the data, and drafted the manuscript. Jingbo WANG, Chihang WEI, and Zhihuan SONG revised and finalized the paper.

Compliance with ethics guidelines

Xinmin ZHANG, Jingbo WANG, Chihang WEI, and Zhihuan SONG declare that they have no conflict of interest.

Frontiers of Information Technology & Electronic Engineering2022年1期

Frontiers of Information Technology & Electronic Engineering的其它文章: Coverage performance of the multilayer UAV-terrestrial HetNet with CoMP transmission scheme?; Design and optimization of a gate-controlled dual direction electro-static discharge device for an industry-level fluorescent optical fiber temperature sensor＊; A relation spectrum inheriting Taylor series:muscle synergy and coupling for hand?; Intelligent radio access networks:architectures,key techniques,and experimental platforms?; An energy-efficient reconfigurable asymmetric modular cryptographic operation unit for RSA and ECC; Dynamic grouping of heterogeneous agents for exploration and strike missions?