ZHENG Kai-yi,FENG Tao,ZHANG Wen,HUANG Xiao-wei,LI Zhi-hua,ZHANG Di,SHI Ji-yong,Yoshinori Marunaka,ZOU Xiao-bo*
1. Key Laboratory of Modern Agriculture Equipment and Technology,School of Food and Biological Engineering,Jiangsu University,Zhenjiang 212013,China 2. Department of Molecular Cell Physiology,Graduate School of Medical Science,Kyoto Prefectural University of Medicine,Kyoto 602-8566,Japan
Abstract Selecting samples in the transfer set is also important in calibration transfer. The purpose of selecting samples in the transfer set is selecting standard samples of both primary and secondary spectra with the same concentrations. After that,the transfer model between primary and secondary spectra can be generated. Finally,the prediction set of secondary spectra can be corrected by transfer model and estimated by the model generated by primary spectra. The commonly used sample selection methods include Kennard-Stone (KS),SPXY and SPXYE methods. Based on the features of those methods,a new sample selection method called weighted SPXYE (WSPXYE) was proposed and applied in transfer set selection. The WSPXYE defines the distance between each paired samples in advance,which is composed of the normalized distances between spectra (dxs),concentration (dys) and errors (des). The weighted sum of the former three distances can set as the WSPXYE distance: dwspxye=αdxs+βdys+(1-α-β)des. After obtaining dwspxye,the samples with large values of dwspxye,can be selected as transfer set. WSPXYE is the generalization on KS,SPXY and SPXYE methods,while KS,SPXY and SPXYE methods are special cases of WSPXYE with the weights of α and β set as 1 and 0; 0.5 and 0.5 and 0.333 and 0.333,respectively. Two calibration transfer methods,including direct standardization (DS) and canonical correlation analysis combined with informative component extraction (CCA-ICE) has been applied to testing the transfer set selected by WSPXYE. Results showed that WSPXYE could choose proper weights to select good transfer samples to achieve low errors in both validation and prediction sets.
Keywords WSPXYE; Kennard-Stone; SPXY; SPXYE; Sample selection; Calibration transfer
The near-infrared spectra (NIR) have been widely used in pharmaceutical[1],environmental[2]and agricultural[3]researches,because of non-destructive testing,easy operation and fast analysis. As an indirect analysis method,the feasible models of NIR should be constructed in advance,including partial least squares (PLS)[4]and principal component regression (PCR)[5]. Although those linear models have a strong ability in NIR spectra analysis,the reliable model calibrated by one batch of spectra cannot be applied commonly to another batch of spectra,due to baseline drift,wavelength drift and absorbance fluctuations. The problem can be solved by constructing many models each corresponding to one batch of spectra. However,these models may be impractical,and the work of building many models is time-consuming. Thus,calibration transfer can be used as an alternative solution to this problem.
In a pair of spectra batches in calibration transfer,the samples used to construct models are called primary spectra,while the uncalibrated samples only using the model of primary spectra are called secondary spectra.In recent years,many calibration transfer models have been proposed,including DS[6-8],PDS[9-11],CCA[12-13],SEPA[14-15],CTWM[16],TEAM[17],CT-VPdtw[18],MWFFT[19],and others. Among the many methods proposed,canonical correlation analysis combined with informative component extraction (CCA-ICE)[20]has shown good results for calibration transfer.
In addition to calibration transfer models,the methods of sample selection for transfer set are also important. Nowadays,sample selection methods like Kennard-Stone (KS)[21],SPXY[22]and SPXYE[23]have been proposed and widely used in calibration transfer. All these methods focus on the distances between the values ofx,yand calibration errors (e). According to our conjecture,the distances ofx,yandemay have different importance for sample selection,and thus,the distances ofx,yandeshould be assigned as different weights for sample selection. Therefore,in this paper,weighted SPXYE (WSPXYE) was proposed to adjust the weights ofx,yandedistances. Meanwhile,WSPXYE was also adopted for sample selection in calibration transfer.
The primary and secondary spectra are symbolized as matrixAandB,respectively. The transfer and calibration sets of spectra A are respectively assigned asAtandAcwhile the transfer,validation and prediction sets of spectraBare designated asBt,BvandBp,respectively. Andysymbolizes sample concentrations.Atcan be obtained fromAcby WSPXYE method. Meanwhile,the samples of spectraBwith the same concentrations ofAtare assigned asBt.
Similar to SPXYE[23],the distances ofx,yandefrom the calibration set of primary spectra withnsamples all can be shown as follows
(1)
dy(p,q)=|yp-yq|p,q∈[1,n]
(2)
de(p,q)=|ep-eq|p,q∈[1,n]
(3)
Here,dx(p,q),dy(p,q) andde(p,q) are the distances of samples inx,yande,respectively. In order to let all weights of the above distances located between 0 and 1,the above distances should be treated as follows
(4)
(5)
(6)
After defining the treated distances ofx,yande,WSPXYE can be shown as follows
dwspxye(p,q)=αdxt(p,q)+βdyt(p,q)+(1-α-β)det(p,q)
(7)
Here,αandβare the weights ofdxt(p,q) anddyt(p,q),respectively. Meanwhile,in order to balance the weights ofdxt(p,q),dyt(p,q) anddet(p,q),the weight ofdet(p,q) can be set as 1-α-β. Thus,the sum of weights of the former three parts is fixed as one. Moreover,in order to make the three weights non-negative,αandβshould meet the following conditions:
0≤α≤1
(8)
0≤β≤1
(9)
0≤(1-α-β)≤1
(10)
Inequation (10) can be simplified as
0≤(α+β)≤1
(11)
Obviously,KS (α=1,β=0),SPXY (α=β=0.5) and SPXYE (α=β=0.333) are all special cases of WSPXYE. Moreover,by merging the methods of KS,SPXY and SPXYE,WSPXYE can also use the parameters includingαandβto adjust the weights of three distances for sample selection. Thus,the sample selection method of WSPXYE is the generalization on KS,SPXY and SPXYE.
Similar to KS and SPXY methods,WSPXYE also selects samples from the calibration set of primary spectra. The procedure of WSPXYE selecting samples from the calibration set with the size ofncan be described as follows:
(1) Define the parameters includingα,βand the number (m,m (2) Compute the distance (dwspxye) between any two paired samples inSu. Then,find two samples (s1ands2) with largest dwspxye. After that,allocate them inSe,and remove them fromSu. (3) Compute thedwspxyebetween any two paired samples belonging toSuandSe,respectively. Choose the sample inSuwith the maximumdwspxye. Then,allocate it ass3inSeand remove it fromSu. (4) Repeat step (3)m-3 times to select the remainingm-3 samples fromSutoSe. (5) Finally,the selectedmsamples inSecan be assigned asAt. Then,the samples of spectraBwith the same concentration as that ofAtare assigned asBt. Thus,AtandBtcan be applied to calibration transfer. The corn datasets obtained from http://www.eigenvector.com/data/Corn/index.html contain three batches of spectra named as m5,mp5 and mp6,respectively. In these three batches of spectra,each one contains 80 samples with a range of 1 100~2 498 nm and 700 data points. Among the three batches of spectra,mp6 was set as primary spectra,while mp5 and m5 as secondary spectra,respectively. Then,the values of oil were set asy. After that,the values ofywere sorted in ascending order,and the middle one of each five contiguous samples was set aside; the residual 64 samples were assigned as calibration set. Meanwhile,in the 16 samples,the first and second samples of each two contiguous ones were set as validation and prediction sets,respectively. Thus,the numbers of samples in both validation and prediction sets are eight. For corn dataset,mp6 was set as primary spectra,while those of m5 and mp5 both as secondary spectra. By searching,the number of latent variables (l) was set as nine due to the small value of root mean square errors of cross-validation (RMSECV) for calibration set of primary spectra. In addition tol,the number of samples in the transfer set should also be focused on DS. Thus,the RMSEV values at different numbers ofmcan be computed. Then,at each number ofm,under different combinations ofαandβ,the one with a small value of RMSEV was selected,while the corresponding RMSEV was applied for comparison. The results are shown as follows: Fig.1 The RMSEV (plot a) and RMSEP (plot b) of KS,SPXY,SPXYE and WSPXYEmethods under different m for mp5 to mp6 transferred by DS Fig.2 The RMSEV values at different α for mp5to mp6 transfer by DS In Fig.1,with m increasing,the RMSEV decreases at first,and afterm>20,the RMSEV keeps nearly constant. Thus,the number ofmcan be set as 20 for selecting 20 samples for transfer. In addition tom,the weights includingαandβshould also be researched. In order to obtain the value ofαwith a low error,the searching range ofαcan be set between 0 and 1 with a stepwise increase of 0.01. Meanwhile,at each value ofα,the RMSEV under different values ofβcan be obtained,and the minimal one can be chosen as the RMSEV at the fixedα. The results can be shown as follows: In Fig.2,it can be shown that atα=0.29,the WSPXYE can achieve small error. Meanwhile,as 0.3 can achieve the same error as that ofα=0.29. In order to find the reason behind this phenomenon,the variations of RMSEVV at differentβunderα=0.29 and 0.3 can be computed and shown as follows: Fig.3 The RMSEV values at different β for mp5 to mp6 transfer by DS at α=0.29 (plot a) and 0.3 (plot b) In Fig.3,it can be shown that atα=0.29 andβ=0.15 andα=0.3 andβ=0.15,the RMSEV can both achieve 0.039 5. The reason may be that the parameter combinations of the former and the latter are close and they generate the same samples in transfer set. Due toα=0.29 andβ=0.15 achieving small RMSEV,the weights ofx,yandecan be set as 0.29,0.15 and 0.56,respectively. Moreover,for the purpose of comparison,the KS,SPXY and SPXYE among the commonly used methods were also executed for selecting transfer samples. Meanwhile,the RMSEP of those sample selection methods can also be computed. The results are shown in Table 1. Table 1 The RMSEV and RMSEP of mp5 to mp6transferred by DS In Table 1,the WSPXYE can achieve small errors compared with other sample selection methods. In order to further test the effectiveness of WSPXYE,the RMSEV and RMSEP at different m can also be computed. Meanwhile,the errors of KS,SPXY and SPXYE methods can also be obtained for comparison. The results are shown in Fig.1. In Fig.1,it is interesting that,at each m,both the RMSEV and RMSEP of WSPXYE are less than those of KS,SPXY and SPXYE methods. This can further prove that the WSPXYE can adjust the weights ofx,yandeto select better transfer sets than other methods. In addition to DS,the CCA-ICE as another calibration transfer method can also be used to test the power of WSPXYE. The RMSEV and RMSEP of KS,SPXY,SPXYE and WSPXYE at different m can be listed as follows: Similar to Fig.1,it can be shown that,at each m in Fig.4,the WSPXYE can generate smaller RMSEV and RMSEP values,compared with the other three methods. Thus,it can be concluded that the WSPXYE selects better samples to obtain good transfer results for mp5 to mp6. For m5 to mp6,the DS and CCA-ICE can be applied to testing the effect of WSPXYE. In the meantime,other sample selection methods including KS,SPXY and WSPXY,can be applied for comparison. The corresponding RMSEV and RMSEP values at different numbers of transfer sets can be shown as follows: In plots a and b,at eachm,the samples selected by WSPXYE and transferred by DS can obtain low errors in both validation and prediction sets. In plot c and d,except form=15,the WSPXYE can still obtain low errors in both validation and prediction sets. Moreover,atm=15,the RMSEP of WSPXYE is higher than that of SPXYE but still lower than those of KS and SPXY. In the meantime,the root means square error of whole samples in both validation and prediction sets are 0.076 4 (KS),0.112 (SPXY),0.080 4 (SPXYE) and 0.064 7 (WSPXYE),respectively. Thus, Fig.4 The RMSEV (plot a) and RMSEP (plot b) of KS,SPXY,SPXYE and WSPXYE methods underdifferent m for mp5 to mp6 transferred by CCA-ICE Fig.5 The RMSEV (plot a: DS; plot c: CCA-ICE) and RMSEP (plot b: DS; plot d: CCA-ICE) of m5 to mp6 WSPXYE can still obtain low estimation errors for CCA-ICE. Therefore,WSPXYE can achieve good results compared with those of KS,SPXY and SPXYE methods for m5 to mp6. The WSPXYE was proposed to adjust the weights ofx,yande(errors) as the distances to select transfer set for calibration transfer. The (KS),SPXY and SPXYE methods are the special cases of WSPXYE with the weights ofx,yandeas 1,0 and 0 (KS); 0.5,0.5 and 0 (SPXY); and 0.333,0.333 and 0.333 (SPXYE),respectively. Two calibration transfer methods including CCA-ICE and direct standardization DS,were applied to testing the WSPXYE methods. The results showed that WSPXYE could adjust the weights to select proper transfer samples and achieve low prediction errors in both validation and prediction sets.2 Datasets description
3 Results and discussion
3.1 The results of mp5 to mp6
3.2 The transfer results of m5 to mp6
4 Conclusion