Pulino Pérez-Rodríguez, Osvl A. Montesinos-López, Aelrdo Montesinos-López,José Cross,d,*
aColegio de Postgraduados, Montecillos, Edo. de México 56230, Mexico bFacultad de Telemática, Universidad de Colima, Colima 28040, Mexico
cDepartamento de Matemáticas, Centro Universitario de Ciencias Exactas e Ingenierías (CUCEI), Universidad de Guadalajara, Guadalajara,Jalisco 44430, Mexico
dInternational Maize and Wheat Improvement Center (CIMMYT), Km 45 Carretera México-Veracruz, Texcoco, Edo. de México 56237, Mexico
ABSTRACT
Keywords:
Genome-based prediction(GP)aims to predict the breeding values of selection candidates of a testing set for which there is only genotypic information(e.g.,dense molecular markers panels,pedigree information)based on phenotypic and genotypic information from a set of related individuals known as the training population.The concept of genomic enabled prediction based on markers was first introduced almost two decades ago by Meuwissen et al.[1],and since then,many researchers have extended the models to include other sources of information.
Nowadays empirical evidence suggests that genomic prediction models are widely used by the industry on plants and animals and in publicly funded breeding research programs[2–4].Originally,the linear statistical models proposed for GP relied on the assumption that the residuals are distributed as independent and identically distributed random variables with null mean and the same variance,which implies that the response variable is also distributed normally.It is well known that the selection process leads to distributions that are skewed[5],that is,if we select for trait Y and there is another trait of interest(O),then the conditional distribution of Y is nonsymmetric(skew),given that the other trait of interest is greater than a given threshold(o),which can be written as Y∣O>o.In the genomics context and in many other breeding situations,it is usual to find a subset of observations that belong to the response variable that differs significantly from the rest,that is,these observations are considered outliers[6].Outlier observations could be due to several reasons such as missing covariates,ignoring the presence of unknown segregating QTL or because of Genotype×Environment interaction[7].Therefore,if genomic data are analyzed under these circumstances,inferences could potentially lead to incorrect results.
There are several statistical approaches to deal with nonsymmetric distributions,some of which are well known:1)transformations[8]and 2)robust regression approaches.In the case of robust regression,the assumption of normality is relaxed,for example,when including residuals with skew distributions or thick tails[6,7,9–11].Montesinos López et al.[6]proposed a model with errors assumed to follow a Laplace distribution which leads to a median regression model;it provides good prediction even when outliers are present in the data;this model is a special case of a more general quantile regression model.Nascimento et al.[10]argued that asymmetry can be overcome with regularized quantile regression and used the model in the genomic context obtaining promising results.Therefore,quantile regression has the potential to simultaneously deal with asymmetric responses and the presence of outliers.Quantile regression also provides a powerful model that can be used to study the relationship between the predictors and the response variable for a set of quantiles,thus providing a more complete view of the phenomenon under study[10,12–14].This is particularly important in the case of genomic selection,for example,because quantile regression can potentially improve the estimation of marker effects for a given quantile and also opens the theoretical possibility of selecting the quantile function that best represents the relationship between predictors and response,although this is challenging[10].On the other hand,researchers who use quantile regression are typically interested in well-defined quantiles that are fixed[15]in the context of genomic selection;for example,a breeder can be interested in obtaining reliable genomic breeding values for high yielding varieties or those that are less affected by a disease.
In this study we develop a Bayesian regularized quantile regression(BRQR)model that allows simultaneously including dense molecular markers as well as the additive relationship matrix derived from pedigree.We believe that the model proposed in this study has not been published before in the context of genome-based prediction.The proposed model is an extension of the model developed by Li et al.[14]and is a generalization of the model developed by Montesinos-López et al.[6].
The organization of the article is as follows.In the Materials and methods section,we introduce the quantile regression,both from the frequentist and from the Bayesian perspective;then we introduce the BRQR model with markers and pedigree.We use the Bayesian ridge regression model as a reference when comparing the predictive power of the proposed BRQR model.Next we describe a simulation experiment to assess the performance of the proposed model when the distribution of the response is non symmetric and outliers are present.Next,we present an application of BRQR with real data and we studied the predictive ability of the proposed model through random cross-validation.Finally,we describe the results and present a brief discussion of the results.We included as supplementary material the derivation of the full conditional distributions necessary to implement a Gibbs sampler and R codes that implement the proposed algorithms.
Consider the following linear model:
The introduction of penalty balances model goodness of fit and complexity[18].From a Bayesian perspective,regularization is done automatically by selecting appropriate prior distributions for the regression parameters.Yu and Moyeed[13]were the first to introduce the Bayesian representation of quantile regression,where the model is fitted using the Metropolis-Hastings algorithm[19,20],which does not scale well in the context of high dimensional data.
Several authors developed hierarchical representations of the model that makes it possible to easily derive conditional distributions of the parameters of interest and implement a Gibbs sampler[14,15,21–23].Li et al.[14]proposed a Bayesian hierarchical representation of the model and,in a unifying framework,presented a set of regularization approaches for this model.The model we propose allows efficient implementation of a Gibbs sampler[24,25]to sample from the posterior distribution and obtain estimates of the parameters of interest.
Let's assume that we have a matrix of markers X,of dimensions n×p,with xij∈{0,1,2},i=1,…,n,j=1,…,p representing the number of copies of a biallelic marker,for example,SNPs,although it can be used with other type of markers,for example,DArT.Let A be a relationship matrix derived from pedigree,Z a matrix that connects phenotypes with genotypes of dimensions n×q.A BRQR that includes an intercept,markers and relationship matrix derived from pedigree jointly using the hierarchical representation in[14]is as follows:
2.3.1.Sampling model and likelihood function
Assuming a random sample from model(4),the conditional distribution yi|μ,β,,τ,viis normal with meanμ+ξ1viand variancevi.The joint conditional distribution of yiand the unobserved random variable viis p(yi,vi|μ,β,,τ)=p(yi|μ,β,,τ,vi)p(vi|τ);therefore the augmented likelihood is given by:
2.3.2.Prior distributions
where H={dfβ,Sβ,dfa,Sa,a,b}is the set of hyper-parameters.
2.3.3.Posterior distribution
The joint posterior distribution of all unknown quantities can be obtained by applying the Bayes theorem,so by combining Eqs.(5)and(6)we obtain:
2.3.4.BRQR with markers(BRQR M)
This model is a special case of model(4)obtained by removing the linear predictor associated with the relationship matrix derived from pedigree,that is:
2.3.5.BRQR with pedigree(BRQR A)
Similarly,this model is a special case of model(4)obtained by removing the linear predictor associated with the markers,and thus obtaining:
Here we consider a linear regression model with normal errors.This model is widely used in genomic prediction[29,30]and we include it as a reference.The model is given by:
The Gibbs sampler algorithm to fit BRQR models was implemented in a program written in R[31].The routine takes as arguments the vector of phenotypes y,matrices X,Z,A,the number of MCMC iterations,a burn-in period,and the hyper-parameters H,which are initialized with the previously described default values,but can be modified.The output of the routine consists of estimates of posterior means of all model unknowns,and in the case of missing values for the phenotypes,it provides the mean of the predictive distribution obtained through the MCMC algorithm.The routine is included as supplementary material.In the case of the BRR model,it can be fitted in the BGLR library[28].
Here we considered a simulation experiment to assess the performance of the proposed model in the presence of outliers when dealing with skewed errors.In order to simplify simulations,we considered the case where only markers are available.The simulation presented here takes elements from[6,11].We generated semi-synthetic data through simulation,and we considered a set of 1279 DArT markers coded as 0 and 1,for 599 wheat lines analyzed originally by Crossa et al.[30],and after that,by many others.We simulated the data using the following linear model:
2.7.1.Maize
Here we considered a dataset from the Drought Tolerance Maize(DTMA)project of CIMMYT's Global Maize Program(http://www.cimmyt.org).The dataset has been analyzed several times[11,34].The dataset consists of 300 tropical inbred lines genotyped using 1152 SNPs markers.The response variable is gray leaf spot(GLS)caused by the fungus Cercospora zeae-maydis.The disease is measured on a scale from 1 to 5,where 1=no disease,2=low infection,3=moderate infection,4=high infection and 5=totally infected.
The experiment was conducted at three sites:Kakamega(Kenya),San Pedro Lagunillas(Mexico)and Santa Catalina(Colombia).An additive relationship matrix(A)derived from the pedigree of the lines is also available.According to Pérez-Rodríguez et al.[11],the skewness index for GLS evaluated in Kakamega is 0.4785,0.7276 for San Pedro Lagunillas and 0.3111 for Santa Catalina.The values are positive in all cases;therefore the data are skewed to the right,which implies that the right tail is long,relative to the left tail,so that most of the mass of the distribution is concentrated around small values of the GLS disease(Fig.1).
2.7.2.Wheat
The wheat dataset considered here is a subset of the data analyzed by Pérez-Rodríguez et al.[35].The response variable is the days to heading in two locations:Agua Fria and Cd.Obregon in Mexico,where wheat plants were grown under standard and drought agronomic management.The dataset corresponds to elite lines from CIMMYT's Global Wheat Program(http://www.cimmyt.org).The data include 306 lines that were genotyped with 1717 diversity array technology(DArT)markers.An additive relationship matrix(A)derived from pedigree for the lines is also available,but it was not originally included in the prediction models.The skewness index for DTH evaluated in Cd.Obregon is-1.9926,and 1.3641 in the case of Agua Fria.Fig.2 shows violin plots for these data,from which it is clear that the data are skewed.
2.7.3.Random cross-validation
We fitted the BRQR and BRR models including:1)pedigree,2)markers,and 3)pedigree and markers jointly.In the case of BRQR,we selected three quantiles:θ={0.25,0.50,0.75}[10].In order to evaluate the predictive ability of the proposed models,we performed cross-validation analyses.We partitioned the data into training,validation,and testing data.For each environment we generated 50 random partitions with 70% of the observations in training,10% in validation,and 20% in the testing set.The idea is to evaluate the model's predictive ability using observations in the testing set.The BRQR models depend on the unknown parameterθ,which in real applications of genome-based prediction where phenotypes in the testing set are not known,must be set before doing the predictions;therefore,a method to set this parameter is needed.The selection of the appropriate quantile that“best”represents the relationship between phenotypes and predictors is a challenging task[10].Here we propose using the validation set to tune this parameter;this approach is widely used in machine learning[36].BRQR models were fitted with 70% of observations in the training set,and the 10% of observations in the validation set were used to compute the mean squared error,which can be used as a criterion for selecting between models in cross-validation settings.BRR models were fitted using 70%of observations in the training set and 10% of observations in the validation set(80% in total).In all the cases markers were centered and standardized.After fitting the models,we calculated the Pearson's correlation between observed and predicted values in the testing set.Inferences were based on 25,000 MCMC samples obtained after discarding 25,000 samples that were taken as burn-in.
Fig.2–Violin plots for days to heading(DTH)in two locations:Agua Fria and Cd.Obregon,Mexico.The mean is represented by the‘×’symbol and the median by the horizontal line inside the box.
Table 1 presents the results from the simulation experiment for different degrees of skewness and different proportions of outliers.From the column that corresponds to correlations between‘true’marker effects and estimated marker effects,it can be seen that in general the correlations for BQRR are higher than those obtained with the BRR.The result is even clearer with higher skewness and proportion of outliers.A similar pattern is observed in the case of correlations between‘true’signal(Xβ)and estimated signal(X^β).In the case of DIC?,results were mixed,but in general the index favored models with low quantiles(0.25 and 0.50),which are the ones with the highest correlations between‘true’and estimated marker effects and also the highest correlation between‘true’and estimated signals.DIC?also favored BRQR models,and in all cases the DIC?was larger for the BRR model than for the BRQR models.
3.2.1.Maize
Results from the random cross-validation obtained as the average Pearson correlation between the observed and predicted values for fitting the BRQR and the BRR models are given in Table 3.The genomic-enabled prediction accuracy measured by the average Pearson correlation shows better prediction accuracy for model BRQRθ=0.25 using A,M and A+M than for the others at the Kakamega site.However,model BRR gave slightly higher prediction accuracy than those obtained by BRQR for sites Santa Catalina and San Pedro Lagunillas.Table 3 presents the results of the mean squared error for the validation set for the BRQR models;for the Kakamega site,the mean squared error in the validation set favored the BRQR withθ=0.25,which is the one with the highest correlations in the testing set.In the case of Santa Catalina and San Pedro Lagunillas,most of the times the mean squared error favored the models withθ=0.25,which agrees with the correlations in Table 2.Thus the mean squared error in validation can be a useful tool for selecting the appropriate value of the parameterθ.
Table 1–Correlations between true and estimated marker effects,correlations between true and estimated signals and estimated variance components associated with the residuals and DIC*for different degrees of skewness and different proportions of outliers.
3.2.2.Wheat
Table 4 shows the results of the cross-validation analysis for days to heading.In the case of Agua Fria,BRQR models had higher correlations in two of the three scenarios considered.For Cd.Obregon,BRQR models gave the best prediction accuracy in the three scenarios evaluated.Table 5 presents the results of the cross-validation analysis for the mean squared error in the validation set for BRQR models.As expected,low values of the mean squared error in the validation set are associated with high correlations between observed and predicted values in the testing set.
Table 2–Average of Pearson's correlation and standard deviation(in parentheses)between observed and predicted values in the testing dataset from crossvalidation analysis for gray leaf spot.
Table 3–Mean squared error and standard deviation(in parentheses)for the testing dataset from cross-validation analysis for gray leaf spot.
Here we introduced a model that is an alternative for skew data and robust in the presence of outliers.Results of simulation show that the model is able to estimate the marker effects more precisely than the BRR model and therefore the signal from the data,i.e.,the Genomic Estimated Breeding Values,could potentially also be estimated better with this model.The advantage of the model is more evident when the data are more skewed and it works well even in the presence of outliers;similar results were reported by[6,7].The results for the real datasets were mixed,but in crossvalidation,the BRQR model was better in three of the five datasets.The proposed model is a robust alternative to the widely used G-BLUP and A-BLUP models,and its prediction ability is at the same level when no outliers are present.We considered only three quantiles,θ={0.25,0.50,0.75};a similar approach was employed by Nascimento et al.[10].We also developed an algorithmic approach that allows selecting the quantile to be used to predict individuals in the testing set;this approach is based on cross-validation and is widely used in machine learning[36].This approach effectively reduces the training data size in the training set,so there is still room for research on generating new algorithms that allows setting this parameter more efficiently.Although we fitted the BRRQ models with smaller sample sizes in comparison with BRR models,the prediction accuracy of the proposed model was similar and,in some cases,better than the accuracy of the competing models.Nascimento et al.[10]compared the accuracy of the regularized quantile regression model against Bayesian LASSO with simulated data that had a heritability of 0.25 and reported gains in terms of prediction accuracy(Pearson's correlation between true and predicted breeding values)that ranged from 55% to 86%.Nascimento et al.[12]compared the predictive ability of Regularized QR models against several Bayesian linear models using flowering time in common bean and reported gains also in terms of prediction accuracy(Pearson's correlation between observedand predicted phenotypes)and reported gains that ranged from 12%to 25%.In our case,the gains in prediction accuracy(when present)are much smaller.
Table 4–Average of Pearson's correlation and standard deviation(in parentheses)between observed and predicted values in the testing dataset from crossvalidation analysis for days to heading.
Table 5–Mean squared error and standard deviation(in parentheses)for the testing dataset from cross-validation analysis for days to heading.
Nowadays,GBLUP and ABLUP models are widely used in GS,and there are many software packages that can fit these models.On the other hand,only a few software packages in the public domain or that were developed commercially are able to fit robust regression models.In the case of BRQR,to our knowledge,BayesQR R package[15]contains the most recent implementations of algorithms to fit this family of models;unfortunately,the BayesQR was not designed to deal with high dimensional data and is not able to fit the model with markers and a relationship matrix derived from pedigree jointly,although for moderate size datasets,other software packages can be used,for example,JAGS[37]or Stan[38].
The computations required to fit BRQR are more complex than those required for fitting,for example,BRR or equivalently G-BLUP.Indeed,our implementation of the Gibbs algorithm without any special optimization or fine tuning to fit BRQR is much slower than the standard Ridge Regression in the BGLR package[28].For example,when completing 1000 cycles of the Gibbs sampler and fitting the BRQR M using the wheat dataset(described previously)and implementing the Gibbs sampler,it took~35 s on a laptop computer with an intel i7 processor@2.8 GHz and 16 Gb of RAM memory.On the other hand,when fitting the BRR in the BGLR package,it took only~1.5 s.
Here we studied how to include markers and a relationship matrix derived from pedigree jointly.In the case of markers,several empirical studies have shown that Reproducing Kernel Hilbert Spaces Methods with Kernel averaging(RKHS)[39],when evaluated in cross-validation settings,exhibit better predictive ability that GBLUP,so a natural step will be to extend BRQR to include Reproducing Kernels.Another topic of interest would be to extend the BRQR model to include multi-traits and multi-environments,and especially,how to include genotype×environment interaction in the BRQR.
Author contributions
Paulino Pérez-Rodríguez,had the initial idea,ran the analyses,and wrote the first version of the article.Osval A.Montesinos-López,Abelardo Montesinos-López,and JoséCrossa read and wrote several other versions of the article,improved the writing,and checked the tables and references.
Declaration of competing interest
The authors declare no conflicts of interest.
Acknowledgments
The maize and wheat data set used in this study comes from the Drought Tolerance Maize for Africa Project and from CIMMYT's Global Wheat Program.We are thankful to everyone who generated the data used in this article.
Appendix A.Supplementary materials
Supplementary materials for this article can be found online at https://doi.org/10.1016/j.cj.2020.04.009.