Qiming Zhao,Kexin Bi,Tong Qiu,
1 Department of Chemical Engineering, Tsinghua University, Beijing 100084, China
2 Beijing Key Laboratory of Industrial Big Data Systems and Applications, Tsinghua University, Beijing 100084, China
3 School of Chemical Engineering, Sichuan University, Sichuan 610065, China
4 Department of Bioprocess Engineering, Institute of Biotechnology, Technische Universit?t Berlin, Berlin 10623, Germany
Keywords:Mathematical modeling Data-driven modeling Process systems Steam cracking Clustering Multivariate adaptive regression spline
ABSTRACT Steam cracking is the dominant technology for producing light olefins,which are believed to be the foundation of the chemical industry.Predictive models of the cracking process can boost production efficiency and profit margin.Rapid advancements in machine learning research have recently enabled data-driven solutions to usher in a new era of process modeling.Meanwhile,its practical application to steam cracking is still hindered by the trade-off between prediction accuracy and computational speed.This research presents a framework for data-driven intelligent modeling of the steam cracking process.Industrial data preparation and feature engineering techniques provide computational-ready datasets for the framework,and feedstock similarities are exploited using k-means clustering.We propose LArge-Residuals-Deletion Multivariate Adaptive Regression Spline(LARD-MARS),a modeling approach that explicitly generates output formulas and eliminates potentially outlying instances.The framework is validated further by the presentation of clustering results,the explanation of variable importance,and the testing and comparison of model performance.
Olefins are high-value-added raw materials manufactured from fuels such as ethane,propane,butane,liquefied petroleum gas,naphtha,and hydrocracking tail oil [1].Ethylene and propylene are light olefins serving as the main building blocks for the petrochemical industry,and polymers produced from them have been commonly used [2].The primary process for manufacturing light olefins is steam cracking [3].The olefins industry faces significant challenges in the post-pandemic era due to rapidly rising production and a flattening worldwide ethylene cost curve [4].New constructions and expansions of olefin factories are intensifying competition [5].In the meantime,intelligent manufacturing techniques promote the digital transformation of the olefins industry through the investigation and application of various modeling,simulation,control,and optimization-based approaches.Therefore,decision-makers are concentrating on intelligent manufacturing technologies to increase profit margins and endure this period of rising competition.
Due to the complex reaction mechanisms in the steam cracking process,it is essential to develop an accurate and broadly applicable predictive model.There are two primary sorts of modeling techniques: rigorous and surrogate models.Rigorous models are constructed using first principles,are usually transparent,and explicitly characterize input-output relationships [6].In most olefin plants worldwide,software packages and toolkits,such as Spyro[7],Coilsim [8],and EcSOS [9],are frequently utilized.Nevertheless,the multiple differential equations of mass,momentum,and heat transfer further complicate predictive modeling for complex production processes,making it time-consuming to obtain accurate solutions [10].
In recent decades,surrogate models based on machine learning techniques have gained popularity and are often regarded as potential alternatives to traditional rigorous models [11].Power law [12],support vector regression [13],and neural network [14]models are typical candidates.However,when applied in realworld production scenarios,surrogate models may have limited interpretability,poor generalizability,low transparency,and severe data sensitivity [11].Knowledge representation and embedding methods,such as variable selection,feature extraction,data clustering,and the selection of appropriate statistical models,are crucial to resolving these issues.
Prior knowledge is utilized in variable selection and feature extraction to transform the variable space,enhancing model interpretability and raising the transparency of the algorithms involved in process modeling.Bikmukhametov and J?schke[15]proposed a method to generate physically meaningful,interpretable,and domain-specific features with knowledge of fluid mechanics.Plehierset al.[16] applied an additional level of abstraction in deep learning structures to extract features that contain more relevant information,resulting in a substantial acceleration of detailed yield predictions for the steam cracking process with minimal loss of precision.Regarding prediction accuracy and ease of interpretation,both models outperformed the ones directly based on raw measurement data.
Data clustering is another valuable technique for grouping samples according to feedstock properties or operational conditions.These algorithms partition the entire industrial dataset into subsets with similar patterns,which facilitates the processing of high-dimensional data and contributes to the subsequent modeling of the steam cracking process.Hanet al.[17] employed an affinity propagation clustering algorithm for dimension reduction and redundant information filtering in petrochemical production data,thereby unlocking the potential for energy savings.Moghadasiet al.[18] applied density-based spatial clustering of applications with noise (DBSCAN) to the operational states of a gas sweetening process,with results fed into gradient boosting machines to reduce energy consumption by 2%.Gonget al.[19]used the k-means algorithm to cluster the working conditions of ethylene production,and the clustering results were validated in a subsequent energy efficiency evaluation.Cluster analysis constructs a multi-level model emphasizing the differences between groups,capturing the big picture in industrial datasets.
Black-box models can approximate functions optimally,but explanations are often post-hoc,and generalization capabilities cannot be guaranteed.For example,deep learning neural networks exhibit low average prediction errors[16],but training can be computationally expensive.Support vector regression(SVR),a popular statistical learning technique,can generalize well with limited data,but its performance degrades in the presence of noise [20].To address these drawbacks,this study explores the multivariate adaptive regression spline (MARS) proposed by Friedman [21].According to several studies in other research domains [22-24],MARS is an effective statistical method for generating explicit expressions from a subset of features.It is nevertheless vulnerable to the combined effects of redundancy and noise.An iterative outlier removal process should be added for model-specific anomaly detection to complement the model-agnostic techniques applied during data preparation.Together with the data preparation and clustering methods mentioned above,the removal step helps resolve the well-known problem of obtaining a trustworthy and highly adaptive model from raw,noisy industrial data [25-27].
This paper proposes an integrated modeling framework incorporating feature engineering,clustering,and data-driven modeling for steam cracking.After industrial data preparation,feature engineering techniques using domain knowledge are introduced to handle the feedstock properties and operational variables,boosting the interpretability and knowledge expression.The k-means clustering algorithm is performed on feedstock-related features.We propose a novel statistical learning approach for model construction,LArge-Residuals-Deletion (LARD) MARS.The algorithm iteratively removes samples with large residuals from the training set,which provides a model-specific strategy for outlier removal.The above steps are compiled into a system for industrial dataset modeling,whose performance is evaluated on multiple datasets and compared to similar approaches.The resulting expressions are relatively straightforward,relevant to the underlying mechanism,and precise enough for later control and optimization.
Industry digitization has made it simpler and more convenient to access industrial dataviamonitoring systems such as the DCS.We present a framework for data-driven intelligent modeling using industrial datasets to develop a model for widespread application.A flowchart of the framework is depicted in Fig.1.During the offline phase,industrial datasets are acquired using digitalized techniques from actual processes.The industrial data are then preprocessed,which includes data cleansing,feature selection and transformation,and data normalization.The core prediction algorithm is LArge-Residuals-Deletion Multivariate Adaptive Regression Spline (LARD-MARS).After validation on more datasets,the whole computation project is compiled and deployed to the enduser for online inference.During the online phase,real-time data are similarly preprocessed before model prediction.
Fig.1.Flowchart of the data-driven intelligent modeling framework for industrial datasets.(a) Offline modeling stage.(b) Online application stage.
Our intelligent framework is adaptable to dataset quality variations,feedstocks adjustments,and minor disturbances to furnace operation.The models are built to maximize prediction accuracy and generalization while minimizing computational complexity and human involvement.
Industrial datasets often contain noise and missing values.Standard data cleansing techniques are performed for the completeness and consistency of industrial datasets,including missing value elimination and outlier removal [28,29].A row-wise deletion is carried out,discarding samples with missing values.A normality test then detects values exceeding 3 standard deviations(3σ)from the variable average.Instances with any variable marked are regarded as statistical outliers and thus removed.
Another essential stage for knowledge expression and interpretable modeling is knowledge-based feature engineering,the generation of physically relevant features that are simplified yet preserve adequate significance[15].Embedded process knowledge may reduce variable dimensions,tackle nonlinear effects,and enhance the clustering and regression procedures [30].
Depending on their physical meaning or units,variables have different scales.A z-score normalization [31] process is applied to the whole dataset to enhance the performance of the feedstock clustering algorithm.
The composition of feedstock directly determines its cracking performance.After extracting physically meaningful features,clustering algorithms can subdivide the dataset into smaller subsets to facilitate accurate modeling.There are four frequently used clustering types: connectivity-based clustering (also known as hierarchical clustering),centroid-based clustering,distribution-based clustering,and density-based clustering[32].Centroid-based clustering appears to be the most appropriate approach to identifying similar feedstocks.K-means clustering [33],a representative centroid-based clustering method,is employed in the proposed framework.It attempts to maximize the total distance between points in different clusters.Due to the constant sum of the distance between points in the entire dataset,the objective function can be transformed to the minimization of the pairwise distance among samples within the same cluster,which is formally expressed as:
whereNis the number of samples in the dataset;Kis the number of clusters;obs1,obs2,···,obsNare observations of specific samples;andS1,S2,···,SKare specific clusters.
The determination of optimalKgenerally uses silhouette plot[34],which can validate the consistency of clustering.The silhouette coefficient,which lies between -1 and 1,measures the similarity of each instance within its cluster relative to other clusters.Primarily positive and high silhouette coefficients indicate good clustering performance.
The features associated with the feedstock serve as the inputs for the clustering algorithm implemented using the scikit-learn package[35]available in Python 3.10.Subsequent regression models are then constructed independently for each cluster.
Production yields are predicted using observational data in the regression model,which must be precise,generalizable,and constructed rapidly for advanced process control and real-time optimization.Our framework adopts a LArge-Residuals-Deletion multivariate adaptive regression spline (LARD-MARS) approach,deriving from the original MARS algorithm.
MARS is an adaptive regression method initially proposed by Friedman [21],which can be considered a generalized version of stepwise linear regression.The modeling strategy tries to capture local patterns in data with piecewise linear basis functions (hinge functions):
Each knot may generate a pair of functions called a reflected pair.A basis functionB(X)can be a constant,a hinge function,or the product of multiple hinge functions.The maximum order of input features determines the model degree.The following equation gives the final form of the MARS model:
with the coefficients α given by standard linear regression of the selected basis functionsBn(X).
Each iteration of the two-step model-building process contains a forward and a backward pass.Starting from a constant term(0th order model),the forward pass adds functions from the collection of candidates that achieve a maximum decrease in training error.The procedure terminates until the fit is satisfactory or the number of knots involved exceeds a predetermined valuenk,producing a complex model overfitting the data.Consequently,the backward deletion procedure prunes unimportant terms using the generalized cross-validation(GCV):
whereNis the total number of observations (cluster size),Mis the effective number of parameters,and RSS is the residual sum of squares.With the correction of model complexity,the GCV approximates cross-validation RSS for measuring model generalization.A smaller GCV suggests a more accurate and generalizable model.
Following the construction of the MARS model,training samples with large residuals are removed.Inaccurate input-output co-distribution is likely to cause anomalous predictions that significantly deviate from nearby samples.These noisy instances are adaptively removed so that MARS may concentrate on a highquality,relevant part of the dataset[25].This procedure introduces a parameteraresidto regulate the threshold of large-residuals deletion;samples removed if their residuals exceedaresid× σ(resid),where σ(resid) represents the standard deviation of the original model’s residuals.σ(resid) accommodates varying output scales,so this threshold determination method is generally applicable.
LARD-MARS has several advantages over previous rigorous and surrogate models: (i) the algorithm is effective at identifying local correlations,integrating seamlessly with k-means clustering;(ii)the basis function selection facilitates the interpretation of variable importance;(iii)a prediction model with explicit expressions enhances model transparency and the practicality of control and optimization;and (iv) the model scales well on large and highdimensional datasets.
The LARD-MARS approach is implemented using the R-earth package [36] on R 4.1.0.Conventional model performance measurements include the coefficient of determination (R2),rootmean-square error (RMSE),mean absolute error (MAE),and mean absolute percentage error (MAPE).
GeneralizedR2(GRSq),a MARS-specific model performance indicator,balances the goodness of fit and model complexity.Similar toR2,it standardizes GCV for simple interpretation among models.The value can be determined by:
where GCV0represents the GCV value of the zeroth-order model(a constant prediction of the mean output value).
Several arguments in the function call may affect the convergence and outcomes of LARD-MARS,hence the need for proper parameter tuning.The highly effective training makes an exhaustive grid search adequate.The tuned parameters consist of the residual threshold controlleraresidand the maximum number of knotsnk.An appropriate value ofaresidproduces a focused model,whilenkcontrols model complexity.The procedure employs 5-fold cross-validation (CV) so that the average performance measurements can be unbiased evaluations of the fitted model [20].
In the steam cracking process,the cracking furnace is the central device in which the pyrolysis of raw materials produces olefins.The distribution of the final products is mainly determined by feedstock properties and the cracking severity of the furnace[37].Fig.2 is a diagram of a typical cracking furnace.The feedstock is preheated,mixed with dilution steam,and evaporated in the furnace’s convection section.Pyrolysis occurs in the radiation section using the heat from burners produced by fuel gas combustion.Depending on the type of tubular reactor,the pyrolysis reactions generally conclude within a period of 0.1-1 s.Finally,the products are transported to the transfer line exchanger (TLE) section and post-processed for downstream production [38].
Fig.2.Diagram of a typical cracking furnace (Pyrolysis mainly occurs in the radiation section).
A complete industrial system simultaneously monitors critical operating parameters and compositions of feedstocks and products.The main operation parameters in Fig.2 include the flow rate(F) of the feedstock and dilution steam,the temperature (T),and the pressure (P) of the crossover section and coil outlet.The compositions of feedstock and products are obtained by online analysis.
Data samples have been collected from an operating plant’s distributed control system (DCS),constituting a high-volume dataset spanning over a year.The online instrumental analysis results are synchronously collected for feedstock and products using an online Fourier transform infrared spectrometer (FT-IR).Downsampling is done to produce an hourly dataset with 9145 samples,effectively mitigating slight operational fluctuations,accidental measurement errors,and algorithmic complexity issues during modeling.The original dataset is referred to asD0.Two smaller validation datasets,D1andD2,are collected similarly.
The following is a list of selected variables and generated features:
(i) Feedstock-related:
(1) The mass ratio of N-paraffins and iso-paraffins (P/I);
(2) The total weight percentage of N-paraffins and iso-paraffins(P+I);
(3) The mass fraction of aromatics (A);
(4) Volume-averaged boiling point (VABP),namely the arithmetic average of five distillation temperatures at 10%,30%,50%,70%,and 90% volume on an ASTM D86 standard distillation curve;
(5) The overall slope of an ASTM D86 standard distillation curve(SL),which is the secant slope of two distillation temperatures at 10% and 90% volume.
(ii) Operation-related:
(1) Feedstock flowrate (FFR);
(2) Dilution steam ratio(DSRATIO),the weight ratio of steam to oil;
(3) Coil outlet temperature (COT);
(4) Crossover section average pressure (CIP),calculated by taking the arithmetic average of several values obtained at measurement points of crossover section pressure.
These variables and features show a significant correlation with the olefins yield.Feedstocks are characterized by specific features utilized by petrochemical analysis engineers.P/I,P+I,andAreflect the difficulty of converting feedstocks to olefins.Previous research has revealed that the yields of light olefins correlate positively withP/IandP+I,but correlate negatively withA[39,40].VABP represents the average weight of the feedstock (average carbon number),whereas SL represents the distribution of carbon numbers.
Most operation-related variables are retained because they are directly related to reactor conditions.The two flow rates,FFR and DSRATIO,determine the space velocity and hydrocarbon partial pressure respectively.COT is the most influential variable affecting cracking depth,significantly impacting operation optimization.Another variable,CIP,represents the effect of momentum transfer.
Thanks to variable selection and knowledge-based feature engineering,physically meaningful features replace directly measured variables as inputs to clustering and predictive modeling.
The original datasetD0undergoes data cleansing,andK-means clustering is performed on the 7278 remaining samples.The cluster analysis considers feedstock-related features,includingP/I,P+I,A,VABP,and SL.Fig.3 depicts the detailedK-means clustering results.
As shown in Fig.3(a),the samples reach optimal clustering configuration when the number of clustersK=2.The 2D visualization ofP/ItoP+Ishows a border between the two subsets.The sizes of the two clusters are comparable,with 2806 and 4472 samples,respectively.The smaller cluster is designated asD0-C1,and the larger cluster asD0-C2.
A silhouette plot of theK-means clustering results is shown in Fig.3(b).Most samples have high positive silhouette values,and the average silhouette coefficient reaches 0.49.The detailedK-means clustering results with 2-4 clusters are given in Figs.S3-S5 of the Supplementary Material.
As an illustration of all input features to the clustering,a feature-pair plot is displayed in Fig.4.Diagonal plots depict the cluster separation contribution of a single feature.Lower-left and upper-right subplots represent scatter plots and kernel density estimation(KDE)plots for each pair of features(symmetrical along the diagonal).P+I,A,and SL are the best indexes in the clustering process based on the diagonal plots because they exhibit the slightest distribution coincidence.From the overlap in the scatter and KDE plots,(P+I,A),(P/I,A),and (P+I,SL) are the best feature pairs for discriminating the specific regional features of the feedstock.Well-separated clusters confirm that knowledge-based feature engineering can effectively generate relevant features.
Fig.4.Feature-pair plot for K-means clustering.Samples are well-separated by all feature pairs.
Fig.5 displays a time-series plot for analyzing the temporal aspect of industrial practice.The steam cracking process is cyclic with planned starting-up and shutting-down,and the alteration of feedstock type usually takes place during the interval of each cycle.As shown in Fig.5,the cluster labels exhibit evident time continuity,suggesting that the naphtha feedstock supplied in a specific period has similar regional features.In addition,the division lines of clusters are relatively close to the operation modeswitching locations (e.g.,starting,stopping,and furnace inlet switching),ensuring a realistic clustering result.
Fig.5.Time-series plot with the labeling of K-mean clustering results.Cluster labels are relatively stable over time.
In the main experiment,LARD-MARS uses the smaller clusterD0-C1.10%of the data are randomly set aside as the test set,leaving the training set with 2244 samples.The Supplementary Material contains the distribution of normalized inputs and outputs(Figs.S11-S13),data sample and variable ranges (Table S1),and dataset examples (Table S2).The model degree (highest order of variable interaction) is set to 2,as previous studies [41-43] determined that a quadratic model can accurately describe a chemical engineering process such as steam cracking.The prediction target is the yield of ethylene.
To determine the best parameters,a grid search with a 5-fold CV is carried out,andR2metrics are averaged over five test splits.Fig.6 shows the result,with deeper colors corresponding to higherR2.Overall,the CV performance is consistent.A larger number of knotsnkincreasesR2and the model complexity.A reasonable choice ofnkis 40,as the accuracy only marginally improves whennkexceeds the value.
Fig.6.Test R2 of 5-fold cross-validation of the original dataset. nk and aresid are tuned,while other parameters remain unchanged.
In Fig.6,longitudinal changes show thatR2will first grow and subsequently drop with an increase inaresid.It can be inferred that a smallaresidis insufficient to remove anomalous responses,whose fluctuations can hardly be identified during the data cleansing step and may cause deteriorated fitting performance.Meanwhile,a high value ofaresidmay result in an excessive deletion of the original response characteristics,and the model may fail to learn from the reduced data distribution.A reasonable value foraresidis 3.0.
The model withnk=40 andaresid=3.0 is further investigated.Residual analysis is employed to evaluate the improvement of predictive performance by eliminating large-residual samples.Fig.7 compares the residual plots of model predictions before and after sample removal.As shown in Fig.7(a),two regions in the original prediction exhibit high negative errors,with yields between 30% and 31% or slightly above 32%.Low prediction of yields may have resulted from unsteady-state processes at startup (data collection) or the inaccurate clustering of feedstock (data processing).Samples with residuals larger than 0.758 are removed,which corresponds toaresid=3.0.After six iterations of removal,a total of 101 samples (4.5%) are removed from the training set,resulting in a model with evenly distributed residuals,which is apparent from Fig.7(b).The testR2increases from 0.928 to 0.937.The refitted model (30 terms) is similar to the original (27 terms)in complexity.Overall,the deletion step in LARD-MARS provides an adaptive and effective method of industrial data modeling.
Fig.7.Residual plots for fitted values on the training set:(a)the original model.Outlying samples exhibit high negative errors.(b)the model after large-residuals deletion.All predictions lie within the boundary.
Fig.8 compares the actual and predicted ethylene yields using a parity plot.Predictions lie close to the diagonal line and follow a more bimodal distribution,which may correspond to two stable working conditions.High GRSq andR2metrics indicate precise predictions.
Fig.8.Parity plots of yield prediction by LARD-MARS.Actual and predicted yields follow a similar distribution.
The final output model for ethylene yield prediction is given by Eq.(6).Explicit expressions ensure a transparent model output containing 1 constant,9 linear,and 20 quadratic terms.As SL(the distillation curve slope)has been effectively separated by clustering,it is unused in the final expression.
These findings suggest that removing samples with large residuals can improve data consistency and model generalization.Accurate model predictions may provide insights into industrial control and operation.Simple symbolic expressions make for an interpretable and trustworthy model for industrial applications.
Predictions for propylene,ethane,propylene/ethylene ratio,and model results based on the large cluster are presented in the Supplementary Material,including Equations S4-S6 and Figures S6-S10.
The contour plot in Fig.9 visualizes the binary relationships between input features.With the selected pair of variables changing,the remaining are fixed at their medians.As an example of feedstock properties,P+IandP/Ipositively correlate with ethylene yields,corresponding to an easier conversion from feedstocks to ethylene.Two typical operational variables,FFR and DSRATIO,are both negatively associated with ethylene yields in the variable space of interest.
Fig.9.A contour plot of binary interactions between variables.Other variables are held at their median values.
The relative influence of each input feature can be evaluated by its variable importance,shown in Fig.10.During the backward pass,the variable importance values are determined using two criteria: ‘‘GCV” and ‘‘nsubsets.” The former computes the total decrease in GCV for each removal of a particular variable,while the latter counts the number of candidate models that include the variable [36].Essential variables are less likely to be removed and cause considerable decreases in GCV when removed.As shown in Fig.10,the two criteria yield the same four most important variables:DSRATIO,CIP,COT,andP/I.Other variables have less impact due to similar data distributions because the model has been built using a single cluster.The identification of critical process variables can help develop control strategies.
Fig.10.Variable importance determined by different criteria: (a) GCV,the total decrease in GCV for each removal of a particular variable;(b) nsubsets,the number of candidate models that include the variable.
In summary,LARD-MARS has successfully captured local patterns of product yields in the steam cracking process.The intelligent modeling framework,including data processing and model implementation,has proved reliable and stable according to performance metrics and validation results.Researchers and industrial practitioners can gain novel insights from predictions and visualizations,especially bivariate correlations.
Two additional datasets,D1andD2,are used to validate its generalization performance.We select the smaller of the two clusters(C1) generated byk-means clustering for each dataset.Parameters are tuned using a grid search with a 5-fold CV,withnkranging from 30 to 60 andaresidranging from 2.0 to 3.5.Among the models with comparable cross-validationR2,we choose the one with the lowestnkand highestaresid.
Validation statistics are listed in Table 1.The optimal parameters are within the searched range,whose values could be suggested for future parameter tuning.We observe a low σ(RMSE)within a dataset and small performance variations among datasets,which show that the model generalizes well on unseen samples.Note that the penalty of model complexity causes the GRSq ofD2-C1to decrease,which corresponds to a model harder to generalize.
Table 1 Cross-validation statistics on different datasets
LARD-MARS is compared with other prominent modeling tools:an artificial neural network (ANN),decision tree regression (DTR),and our previously proposed rigorous model(EcSOS).Experiments are conducted using the same training set mentioned above,and 10 randomly selected instances from the test set for rapid evaluation.Performance measurements are listed in Table 2.
Table 2 Test performance of other modeling methods
ANN has the lowest RMSE among the four models,but its precision comes at the expense of model interpretability and generalization.A complex network may lead to overfitting [44],and unreliable gradient information on the borders and beyond makes it challenging to reuse models.
DTR is an interpretable example of statistical machine learning models.Analogous to the human decision-making process,regression trees can be easily visualized and understood [45].Nonetheless,the structure causes a decrease in prediction smoothness,continuity,and accuracy,resulting in a higher RMSE.
LARD-MARS exhibits balanced performance compared with the ANN and DTR.Its RMSE is comparable to the ANN’s,and explicit formulas support its interpretability.Transparent mechanisms in the model contribute to good generalization around the local variable space of the training dataset.
In addition,the performance of LARD-MARS and EcSOS is compared.The rigorous EcSOS model has not been adjusted to the training set.In this case,LARD-MARS achieves greater precision than EcSOS.Although the performance of EcSOS may be enhanced by tuning its internal parameters based on the training set,the time required and model construction costs would be remarkably higher.
As a successful trade-off between accuracy and interpretability,the LARD-MARS approach may have potential applications in other steam cracking products,different furnace types,and other refining processes.
A framework for data-driven intelligent modeling has been developed for the steam cracking process,which satisfies the requirements for smart manufacturing in the petrochemical industry and provides reliable predictions.The framework integrates industrial data preparation,knowledge-based feature engineering,feedstock clustering,and the LArge-Residuals-Deletion Multivariate Adaptive Regression Spline (LARD-MARS) modeling algorithm.
The modeling results exhibit precision,stability,and interpretability.In the clustering stage,the high significance of knowledge-aware features is confirmed,with a silhouette coefficient value of 0.49 when there are two clusters.The main parameters in LARD-MARS are tuned by cross-validation.In the final model of ethylene prediction,the testR2reaches 0.937,which is acceptable for industrial applications.The large-residual deletion has enhanced the model performance of the classic MARS algorithm.
The entire framework has been preliminarily applied to a steam cracking dataset but is also applicable in other engineering scenarios.Methods for feature extraction,clustering,and prediction are adaptable to the needs of users.The proposed framework is expected to assist petrochemical enterprises in precise process modeling and subsequent control or optimization.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgements
The research work reported here was supported by the National Key Research and Development Program of China (2021 YFB 4000500,2021 YFB 4000501,and 2021 YFB 4000502).
Supplementary Material
Detailed steps and pseudocodes of LARD-MARS,detailed results of k-means clustering,the prediction results by LARD-MARS,and dataset examples.(Word Document).Supplementary data to this article can be found online at https://doi.org/10.1016/j.cjche.2023.03.020.
Nomenclature
Chinese Journal of Chemical Engineering2023年9期