A multivariate partial least squares approach to joint association analysis for multiple correlated traits

2016-04-05 08:41:03YangXuWenmingHuZefengYangChenwuXu

The Crop Journal 2016年1期

Yang Xu, Wenming Hu, Zefeng Yang, Chenwu Xu*

Jiangsu Key Laboratory of Crop Genetics and Physiology, Co-Innovation Center for Modern Production Technology of Grain Crops, Key Laboratory of Plant Functional Genomics of the Ministry of Education, Yangzhou University, Yangzhou 225009, China

Yang Xu1, Wenming Hu1, Zefeng Yang, Chenwu Xu*

A R T I C L E I N F O

Article history:

Received 29 July 2015

Received in revised form

8 November 2015

Accepted 27 November 2015

Available online 3 December 2015

Keywords:

Association analysis

Multiple correlated traits

Supersaturated model

Multilocus

Multivariate partial least squares

A B S T R A C T

Many complex traits are highly correlated rather than independent. By taking the correlation structure of multiple traits into account, joint association analyses can achieve both higher statistical power and more accurate estimation. To develop a statistical approach to joint association analysis that includes allele detection and genetic effect estimation, we combined multivariatepartialleastsquaresregressionwithvariableselectionstrategiesandselectedthe optimal model using the Bayesian Information Criterion (BIC). We then performed extensive simulations under varying heritabilities and sample sizes to compare the performance achieved using our method with those obtained by single-trait multilocus methods. Joint association analysis has measurable advantages over single-trait methods, as it exhibits superior gene detection power, especially for pleiotropic genes. Sample size, heritability, polymorphic information content(PIC), and magnitude ofgeneeffects influencethestatistical power, accuracy and precision of effect estimation by the joint association analysis.

?2015 Crop Science Society of China and Institute of Crop Science, CAAS. Production and hosting by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

* Corresponding author. Tel./fax: +86 514 87979358.

E-mail address: qtls@yzu.edu.cn (C. Xu).

Peer review under responsibility of Crop Science Society of China and Institute of Crop Science, CAAS.

1These authors contributed equally to this work.

http://dx.doi.org/10.1016/j.cj.2015.11.001

2214-5141/?2015 Crop Science Society of China and Institute of Crop Science, CAAS. Production and hosting by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

1. Introduction

The recent development of next-generation sequencing and high-throughput genotyping has stimulated interest in identifying quantitative trait loci (QTL) for complex traits in plants using association mapping [1]. Association mapping, also known as linkage disequilibrium mapping, is an approach to identify the relationship between target traits and molecular markers or candidate genes in natural populations based on linkage disequilibrium [2]. Association analyses achieve higher resolution than conventional mapping methods. The application of association mapping to plants has advanced, though it started relatively late [3,4]. Association analysis has not only corroborated previous findings, but also identified previously unknown loci, greatly assisting molecular breeding [5–8].

Most existing association methods can be classified as either single-locus or multilocus approaches. The first of these detects associations between each locus and each trait based on individual test statistics. Single-locus methods are currently popular in association mapping, but they have several shortcomings [9,10]. First, they ignore linkage disequilibrium information contained among multiple loci. Second, these methods typically have low power, owing to themultiple testing procedures that are required to control false discovery rate. In contrast, multilocus association studies can overcome these problems [11–13]. In multilocus association analyses,targettraitvaluesandallelicvariationwithincandidate genomic regions can be treated as response and independent variables, respectively. In this case, the number of independent variables greatly exceeds the number of observations, producing a supersaturated model. Many statistical methods have been developed to deal with variable selection under supersaturated models. Stepwise regression is the most common subset selection method [14], but it has severe difficulty in handling collinearity [15]; Hoerl and Kennard [16] proposed ridge regression to address this collinearity problem. Compared with ordinary least squares (OILS), ridge regression gives smaller mean squared error terms by imposing a penalty on the regression coefficients. The least absolute shrinkage and selection operator (LASSO) is an improvement over ridge regression [17]. The only difference between ridge regression and LASSO is that the former uses the sum of the squared values of the coefficients as a penalty term, whereas the latter uses the sum of their absolute values. This small adjustment in LASSO shrinks some coefficients to zero. However, LASSO becomesunsuitablewhenthenumberofindependentvariables is much larger than the number of observations, as the number of selected variables is bounded by the sample size [18]. Elastic net, a hybrid of ridge regression and LASSO, overcomes the limitations of LASSO by removing the limitation on the number of selected variables [19].

Besides these shrinkage methods, factor analysis-based methods, such as principal component regression (PCR) and partial least squares (PLS), have received attention as multilocus association mapping techniques [9,10,20,21]. Both PCR and PLS are methods that reduce dimensionality by transforming the original data into a new set of linearly uncorrelated variables called components, but the two methods construct these components in different ways. PCR constructs them by maximizing the explained variance of the independent variables without considering the correlations among independent variables and responses [22], whereas PLS takes such correlations into account [23]. Accordingly, PLS is able to fit response variables with fewer components. The general algorithm for computing PLS components is the nonlinear iterative partial least squares (NIPALS) algorithm [24]. Alternatively, either the kernel algorithm [25] or the SIMPLS algorithm [26] can be used. These algorithms differ slightly in numerical accuracy. Bayesian-based variable selection methods are also available [27], including Gibbs variable selection [28], adaptive shrinkage, and stochastic search variable selection (SSVS) [29]. SSVS, a Markov chain Monte Carlo sampling-based method, assumes that each independent variable follows a mixture of two normal distributions, and has been extensively used in gene mapping [29].

However, the above methods are still implemented at the single-trait level. In biological research, many datasets contain observations of multiple correlated traits. Unlike single-trait association analysis, which fails to extract additional information from correlated traits, joint association analysis makes explicit use of the correlation structure among such traits [30]. It is thus likely that multiple trait analyses will achieve greater statistical power for gene detection and more accurate estimates of genetic effects. For molecular breeding, it is essential to accurately identify genes and precisely estimate genetic effects.Therehavebeena few studies focusing on variable selection under multivariate supersaturated models. Graph-guided fused LASSO has been developed for joint analysis of multiple traits [31]. Additionally, multivariate PLS-based methods have been introduced. For instance, Yin, Zhang and Liu [32] proposed a two-stage variable selection model that uses a modified Akaike information criterion to screen the variables; similarly, Andries, Heyden and Buydens [33] used four terms (the mean and norm of the regression coefficient and their significance levels) to determine the optimal model. However, these multivariate PLS methods have not been used for joint association mapping.

The objective of this study was to develop a statistical approach for multivariate association mapping that detects elite alleles and estimates genetic effects. We combined multivariate PLSwithstepwiseregressiontoreducethevariable dimension and then selected the optimal model using the Bayesian information criterion (BIC). We applied our method to simulated data and compared its performance with those of LASSO [17], and PLS-based MLAS (multilocus association studies) [9]. We chose these two methods as representatives of single-trait approaches because of their popularity in recent association mapping studies [34–37].

2. Methods

2.1. Model construction

Table 1–Encoding of aiindependent variables of ai+ 1 alleles at candidate gene i. Alleles　Independent variable x1　x2　　…　xai1 1 0 …0 2 0 1 …0……………ai　0　0　　…1 ai+ 1　　–1　　–1　　…　　–1

Each locus or candidate gene is treated as an independent variable. The encoding method assumes that the number of loci or genes is g and that the ith locus or gene has ai+ 1 (ai≥0) alleles, so that the degrees of freedom are ai. Thus, g loci or candidate genes haveindependent variables in total. The full rank code for ai+ 1 alleles at each candidate gene i is listed in Table 1. For example, if the ith gene has two alleles, this ith gene has one independent variable and the codes for the two alleles are 1 and?1, respectively.

The phenotypic vector is described by the linear model

where Y is an n×q matrix of q phenotypes for n individuals; 1nis an n-vector of ones; U is a q-vector of the grand means of q phenotypes; X is an n×m design matrix of genotypes for m loci, with each element assigned according to the variable code; B isan m×q matrix of regression coefficients; E is an n×q matrix of error terms that are assumed to follow a multivariate Gaussian distribution with mean zero and variance–covariance matrix

2.2. Statistical procedure

2.2.1. Parameter estimation based on multivariate PLS

Thegeneralunderlyingmodelofmultivariate PLSis X = TPT+ E and Y = TQT+ F, where T (n×a) is an X-score matrix that consists of linear combinations of the predictor variables, P (m×a)and Q(q×a)aretherespectiveloadingmatricesof Xand Y, and E and F are error terms. The goal of PLS is to maximize the covariance between T and Y. To establish the model, a weight matrix, W (m×a), is produced for X such that T = XW. According to the inner relation

the estimates of regression coefficients are B = WQT. Detailed algorithms for computing PLS regressions can be obtained from the literature [26]. The regression coefficients can then be used to identify relevant variables.

2.2.2. Variable importance in projection (VIP)

VIP is also a useful variable selection method [23,38]. The VIP score is computed for each variable and latent factor as

where m is the number of variables; c is the number of PLS components; whjis the PLS weight of the jth (j = 1, 2,…, m) variable for the hth (h = 1, 2,…, c) component; and SSh(Y) is the proportion of Y explained by the hth component.

2.2.3. Dimension reduction

First, the absolute values of the regression coefficients for each phenotypes are calculated and the independent variables are chosen such that(if n is odd) is maximized to obtain the subset Skb(k = 1, 2,…, q). Second, the VIP scores are sorted in descending order and the independent variablesarechosensuchthatVIP(ifnis odd) is maximized, yielding a new subset SVIP. Finally, these subsets are integrated to form a new set of candidate independent variables, Sk= Skb∪SVIP. The number of independent variables in Skis less than n?1. After this dimension reduction, the supersaturated model is successfully translated to an ordinary linear model.

2.2.4. Variable selection and effect estimation

Stepwise regression is a common method for variable selection. We use bidirectional elimination and BIC as the selection criterion [39]. BIC is defined as where L is the maximized log likelihood of the model, f is the number of parameters in the model, and n is the number of observations. Thus we choose the model with the smallest BIC value and then compute the effects of independent variables and the statistical significance of these effects using OILS.

2.3. Simulation

2.3.1. Simulation 1

To investigate the properties of the proposed approach to joint association mapping under varying scenarios and compare it with single-trait methods, we performed extensive simulation experiments. Without loss of generality, we restrict our discussion to two traits.

We assumed a total of 10,000 loci uniformly distributed throughout a genome, each with two alleles. Ten loci, labeled Q1–Q10, were randomly selected to have genetic effects. The first four, Q1–Q4, affected only trait 1 and the last four, Q7–Q10, affected only trait 2. We assigned two pleiotropic loci, Q5 and Q6, affecting both traits. Without loss of generality, we assigned different levels of polymorphism information content (PIC) to these loci; PIC for a locus is defined as PIC＝, where pdand peare the population frequencies of the dth and eth alleles among a total of l alleles [40]. The gene effect sizes and PIC values of the ten loci are listed in Table 2. It is obvious that the loci controlling trait 1 differ only in PIC value and that the loci controlling trait 2 differ only in effect values. We were thus able to compare the detection power of loci with different PIC and effect values.

Three levels of heritability (h2), 30%, 50%, and 70%, were simulated, each under three different sample sizes (n), 100, 300, and 500. For each of nine scenarios, 200 replicates were performed. The grand means for the two traits were both assumed to be 10. The residual correlation coefficients were expectedto be0.5between traits 1and 2. Theresidual variances of the traits were calculated according to the heritability and genetic variances. The residual covariance was determined by the corresponding residual correlation coefficient and the residual variance. The following criteria were used to assess the performance of the methods:

(1) statistical power of gene detection, denoted by the proportion of significant replicates in 200 replicates;

(2) accuracy and precision of the estimated effects, determined by the means and standard deviations of the effect estimates.

For comparison, each simulated data set was analyzed using the joint association method, LASSO, and PLS-based MLAS.

2.3.2. Simulation 2

The main purpose of this simulation was to compare the performance of the joint association method under different levels of residual correlation coefficients. We again defined a total of 10,000 loci and assigned genetic effects to 10 randomly selected loci labeled G1–G10 under a fixed sample size of 300. However, the effects of the 10 loci were sampled from anormal distribution with mean 0 and variance 1. Again, three levels of heritability (h2), 30%, 50%, and 70%, were simulated. The residual correlation coefficients between traits 1 and 2 were set at 0.2, 0.5 and 0.8.

All data simulations and analyses were implemented in the R statistical computing environment [41].

Table 2–Effect values and PIC values of genes in simulations. Q1　Q2　 Q3　 Q4　Q5　Q6　Q7　Q8　Q9　 Q10 Effect value　Trait 1　2　2　2　2　2　2　0　0　0　0 Trait 2　0　0　0　0　1.5　?1.5　2.0　?2.0　2.5　?2.5 PIC value　0.07　0.22　0.3　0.35　0.37　0.37　0.37　0.37　0.37　0.37

3. Results

3.1. Statistical power for gene detection using the joint association analysis

Fig. 1 shows that statistical power is substantially enhanced by increasing sample sizes and heritabilities. In the scenario in which n = 100 and h2= 30%, the average detection power of the 10 loci is only 15.3%. However, when n = 500 and h2= 70%, the average detection power reaches 99.7%. Taking as an example the single locus Q2, when n = 100 and h2= 30%, the detection power of Q2 is only 5.5%, whereas it increases to 100% when n = 500 and h2= 70% (Table 3). Table 3 also shows that statistical power is correlated with PIC values. Loci with high PIC values are more easily detected. However, it is difficult to detect loci with low PIC values even if the sample size is not low. For example, in the scenario in which n = 300 and h2= 30%, we still fail to detect Q1, but the statistical power to detect Q4 rises to 93%.

The magnitude of gene effects strongly influences gene detection, but the sign of the gene effects makes no obvious difference (Table 3). As the magnitude of gene effects grows, the detection power increases. For instance, in the scenario in which n = 300 and h2= 30%, the powers to detect Q7 and Q8 are 73.5% and 68%, respectively, whereas the powers to detect Q9 and Q10 are 95% and 95.5%, respectively.

3.2. Accuracy and precision of estimated effects using joint association analysis

The means and standard deviations of the estimated parameters under nine different combinations of simulation parameters are presented in Table 4. As expected, larger sample sizes and higher heritabilities tend to produce more accurate and precise estimates. In the scenarios with larger sample sizes and higher heritabilities, the estimated effect values are almost equal to the true values. When n = 500 and h2= 70%, the deviations of the estimated from the true values for Q1–Q10 are less than 0.03. Genes with higher PIC values are more likely tobeestimatedaccurately(Table4).Whenn = 100andh2= 30%, thedeviation of the estimated from the true value for Q2 is1.67, while this deviation for Q4 is 0.47. When n = 500 and h2= 50%, thedeviationis0.06for Q2 and 0.01for Q4.PIC valuesalsoaffect the precision of parameter estimates. For the scenario in which n = 100andh2= 70%,thestandarddeviationsofgeneeffectsare 0.49, 0.46, 0.40, and 0.30 for Q1, Q2, Q3, and Q4, respectively, whereas in the scenario with n = 500 and h2= 70%, these standard deviations are 0.17, 0.11, 0.09, and 0.08. As the magnitudeofgeneeffectsincreases,theestimatedeffectvalues tend to be more accurate (Table 4). For instance, when n = 100 and h2= 30%, the deviations of the estimated from the true values for Q7 and Q8 are 0.60 and 0.95, respectively, whereas these deviations for Q9 and Q10 are 0.38 and 0.50, respectively.

Fig. 1–Statistical power of joint association analysis affected by sample size and heritability.

3.3. Comparison of the joint association analysis with single-trait methods

The statistical power of the joint analysis, LASSO, and PLS-based MLAS is summarized in Table 3, from which it is clear that the joint analysis offers substantial advantages for detecting pleiotropic genes. Taking Q5 and Q6 for example,when n = 100 and h2= 50%, the detection power of the joint analysis reaches 84.5% and 78.5%, respectively, whereas the power of PLS-based MLAS is 32% and 68% for trait 1 and only 4% and 1% for trait 2. The power of LASSO is 37.5% and 54.5% for trait 1 and 1.5% for both loci for trait 2. In addition, in most scenarios, even though loci control only one trait, the joint analysis yields higher power than the other two methods. For example, when n = 100 and h2= 70%, the power of the joint method to detect Q7 is 60% while those of PLS-based MLAS and LASSO are 46% and 36.5%, respectively. Similarly, the power to detect Q10 under these conditions is 96% using the joint analysis but 85.5% and 93.5% using PLS-based MLAS and LASSO, respectively. However, the joint method shows no advantage in detecting Q1. It fails to detect Q1 when the sample size is 100, and even when the sample size reaches 500, the power is still lower than that of the single-trait methods. Thus, the joint method may be unsuitable for genes with very low PIC values. We may also compare the false discovery rates for the joint analysis, PLS-based MLAS, and LASSO. Fig. 2 shows that the joint analysis yields lower false discovery rates than PLS-based MLAS and LASSO except for the scenarios in which n = 300 and h2= 70% or n = 500 andh2= 50% (Fig. 2-a). In the scenarios with large sample sizes and high heritabilities, the estimated false discovery rates of the joint analysis approach zero.

Table 3–Comparison of statistical powers under different scenarios using PLS-based MLAS, LASSO, and joint association analysis. Sample size　 Heritability (%)　 Method　Statistical power (%) Q1　 Q2　 Q3　 Q4　 Q5　 Q6　 Q7　 Q8　 Q9　 Q10 100　30　S1 (PLS)　5.5　 6.5　 2.5　 6　 12　 16.5 S2 (PLS)　1.5　 0　6.5　 12.5　 20　 21 S1 (LASSO)　 0　5.5　 5.5　 1　6.5　 17.5 S2 (LASSO)　0.5　 0　5　4.5　 14　 17.5 J12　0　5.5　 9　9.5　 33.5　 30.5　 7　 13.5　 23.5　 21.5 50　S1 (PLS)　8.5　 7.5　 10.5　 45.5　 32　 68 S2 (PLS)　4　1　 22　 31　 69　 61 S1 (LASSO)　 4.5　 6.5　 12　 62　 37.5　 54.5 S2 (LASSO)　1.5　 1.5　 5　 14.5　 28　 85.5 J12　0　8.5　 16　 64.5　 84.5　 78.5　 24　 34.5　 65　 87.5 70　S1 (PLS)　9.5　 13.5　 50　 82　 61.5　 94.5 S2 (PLS)　6.5　 0　 46　 51　 94.5　 85.5 S1 (LASSO)　 1　8.5　 64　 82　 11.5　 93.5 S2 (LASSO)　8.5　 0　 36.5　 87.5　 89.5　 93.5 J12　10　 17.5　 72　 74　 93　 99.5　 60　 71.5　 92　 96 300　30　S1 (PLS)　 15　 40　 74　 86　 80　 85 S2 (PLS)　39　 82　 99　 20　 47　 77 S1 (LASSO)　 2　 55　 64　 80.5　 91　 70.5 S2 (LASSO)　9.5　 30.5　 55.5　 61　 85　 87 J12　0　 57　 82　 93　 91　 94.5　 73.5　 68　 95　 95.5 50　S1 (PLS)　 35　 81　 100　 100　 98　 98 S2 (PLS)　26　 59　 86　 52　 94　 100 S1 (LASSO)　 30　 93.5　 99.5　 98　 97　 95.5 S2 (LASSO)　50.5　 40.5　 98　 94　 100　 100 J12　12.5　 99　 99　 100　 100　 100　 99.5　 100　 100　 100 70　S1 (PLS)　 58　 99　 100　 100　 100　 99 S2 (PLS)　74　 99　 100　 100　 100　 100 S1 (LASSO)　 79.5　 99　 100　 100　 100　 100 S2 (LASSO)　91.5　 88　 100　 100　 100　 100 J12　28　 99.5　 100　 100　 100　 100　 100　 100　 100　 100 500　30　S1 (PLS)　 19　 58　 93　 100　 99　 95 S2 (PLS)　42　 75　 97　 97　 100　 100 S1 (LASSO)　 18.5　 57.5　 91　 96.5　 97.5　 96.5 S2 (LASSO)　47.5　 90　 98.5　 87.5　 98　 100 J12　5　 64　 98　 100　 100　 100　 96　 98.5　 100　 100 50　S1 (PLS)　 58　 99　 100　 100　 100　 98 S2 (PLS)　90　 100　 100　 100　 100　 100 S1 (LASSO)　 60　 97　 99.5　 100　 100　 98.5 S2 (LASSO)　86.5　 100　 99.5　 98.5　 100　 100 J12　17　 99.5　 100　 100　 100　 100　 100　 100　 100　 100 70　S1 (PLS)　 98　 100　 100　 100　 100　 97 S2 (PLS)　96　 99　 99　 98　 97　 94 S1 (LASSO)　 97.5　 100　 100　 100　 100　 99 S2 (LASSO)　83　 100　 99.5　 98　 100　 100 J12　97　 100　 100　 100　 100　 100　 100　 100　 100　 100 S1 and S2 indicate the single-trait analysis for trait 1 and trait 2, respectively. J12 denotes the joint association analysis for two traits.

Table4–Accuracyandprecisionofestimatedeffectsusingjointassociationanalysis. SamplesizeHeritability(%) Meansandstandarddeviationsofgeneeffects Q1Q2Q3Q4Q5Q6Q7Q8Q9Q10 (2) (2) (2) (2) (2) (1.5) (2)(?1.5)(2)(?2)(2.5)(?2.5)100 30––2.423.67a±±0.40.5786 0 7 4 9 0 1 b 2.97±0.4780480569 2.47±0.4480467348 1.50±0.3947457229 2.78±0.4259479414 2.13±0.519 ?2.66±0.392.6±±0.40.485 5 3 8 0 7 3 3 ?2.95±0.772.88±0.5336260614 ?3.00±0.63502.672.11±0.31.99±0.32.06±0.41.87±0.31.88±0.3?2.09±0.452.32?2.23±0.422.70±0.5?2.34±0.5170±0.49 1.89±0.41.93±0.42.02±0.32.02±0.21.69±0.21.96±0.3?1.47±0.271.81±0.3?1.90±0.332.26±0.3?2.38±0.42300 303.18±0.51.87±0.32.07±0.31.79±0.31.84±0.21.94±0.3537159 ?1.83±0.242.04±0.3?2.05±0.362.39±0.4?2.42±0.3950?3.19±0.32 9 9 7 7 1.91±0.21.93±0.22.07±0.21.85±0.21.54±0.21.90±0.2?1.53±0.252.01±0.2?1.97±0.282.47±0.2?2.48±0.29702.27±0.21.98±0.22.04±0.21.95±0.11.90±0.11.51±0.11.95±0.1?1.49±0.182.01±0.2?1.99±0.192.52±0.2?2.49±0.19500 303.09±0.32.08±0.22.08±0.22.01±0.21.95±0.21.70±0.22.07±0.2?1.70±0.252.07±0.3?2.03±0.312.51±0.3?2.53±0.34502.44±0.21.94±0.21.99±0.11.99±0.11.96±0.11.49±0.21.96±0.1?1.52±0.222.01±0.2?1.99±0.232.51±0.2?2.52±0.22702.02±0.12.02±0.11.98±0.01.99±0.01.98±0.01.51±0.11.98±0.0?1.50±0.142.01±0.1?2.00±0.152.49±0.1?2.50±0.14“–”denenotesotesgenesnotdetectedand“()”denotesthetruevalue. aDmeanvdaalue.bDenotesstanrddeviation.

3.4. Statistical power of the joint association analysis under different residual correlation coefficients

The performance of the joint analysis under different residual correlation coefficients and heritabilities is described in Fig. 3. It can be seen that the joint analysis has similar power to detect most loci under different residual correlation coefficients. The power to detect G1, G2, G4, G5, G6, and G8 is almost 100% when h2= 50% or 70%. However, G3 can hardly be detected even though h2= 70%. For G9, when h2= 30%, the joint analysis has the highest power when the residual correlation coefficient equals 0.5. However, when h2= 50%, the power to detect G9 is the highest in the scenario in which the residual correlation coefficient equals 0.8. These findings indicate that our method is not sensitive to residual correlation coefficient.

4. Discussion

We have described a joint association analysis approach to detecting associated genes and estimating their genetic effects, using multivariate PLS and stepwise regression. The results suggest that gene detection power as well as accuracy and precision of parameter estimation increase with sample size, heritability, PIC value, and magnitude of gene effects. By accounting for information from correlated traits, the joint association method offers several advantages over single-trait methods [30]. First, the joint analysis shows substantial advantages in detecting pleiotropic genes. Second, in most simulation scenarios, the joint association analysis outperforms single-trait analysis with respect to statistical power of gene detection. Similar results have been previously reported [30,42,43]. Jiang and Zeng [30] described a multiple trait analysis approach to mapping QTL. Xiao et al. [43] used joint segregation analysis to detect genes of major effect. Our method adopts variable selection strategies and its computational complexity is considerably simpler than that of the single-trait methods. However, there is no advantage to using joint analysis to analyzecandidategeneswithminor PIC,suchas Q1.Thislackof power may be due to the loss of variance during component extraction using PLS regression.

To select the optimal model in simulation data, we preset the maximum number of steps (in the stepwise regression) according to the number of true effects, but in real data, the number of causal genes is unknown a priori. For this reason, it is necessary to establish a critical value. Bonferroni correction is commonly used to correct for multiple tests of significance, but this method is highly conservative and accordingly may miss real genes [44]. Holm [45] initially proposed an easily implemented sequentially rejective multiple test procedure, and Shaffer [46] then improved the power of Holm's procedure at the cost of greater complexity. Holland and Copenhaver [47] showed that these procedures can achieve more power on the assumption of positive orthant dependence of the statistics. In analysis of real data, the above methods could be used toset appropriate threshold values. BIC, as a variable selection criterion, can also be adjusted depending on different objectives. If the objective is to perform marker-assisted selection, a heavierpenaltycanbeimposedonthemodelsothatgeneswith small effects will be eliminated. However, if the objective is genome wide prediction with a large number of markers, the penalty of the BIC can be appropriately reduced. In this study, weselectedthelargestand VIP to reduce the variable dimensions. Effect sizes of most genomic loci follow an inverse chi-square distribution; specifically, the design matrix is a so-called sparse matrix in which small minorities of loci have large effect values and the remaining loci have slight or zero effects. Thus, we can further reducethetotalnumber ofelementsincluded in Skband SVIP. Itis also acceptabletochoosethelargestn/3orevenfewerelements to form a new set.

Thisstudyislimitedinthattheproposedmethodappliesonly to natural populations without apparent kinship; accordingly, our next objective is to modify the model to consider the effects of kinship and population structure. We intend to adopt mixed models and treat kinship as a covariate to expand the applicable scope of this method. Additionally, we have formulated our analysis with additive effects, but it is also desirable to consider interaction effects, using the method of Zeng et al. [48] for reference.

Fig. 2–Comparison of false discovery rates of PLS-based MLAS, LASSO, and the joint association analysis under different

5. Conclusion

We have demonstrated a joint association approach to identifying genes and estimating their genetic effects. Compared to LASSO and PLS-based MLAS, the joint analysis considerably improves the ability to detect pleiotropic genes. In most scenarios, the joint analysis yields higher power and false discovery rates than the single-trait methods. Sample size, heritability, PIC, and magnitude of gene effects contribute strongly to the statistical power, accuracy, and precision of effect estimation. However, single-trait methods are superior to the joint analysis for detecting genes with low PIC values, perhaps owing to the loss of variance during component extraction using PLS regression.

Acknowledgments

Thisworkwassupportedbygrantsfromthe National Programon the Development of Basic Research (2011CB100100), the Priority Academic Program Development of Jiangsu Higher Education Institutions,the National Natural Science Foundations(31391632, 31200943, 31171187, and 91535103), the National High-tech R&D Program (863 Program) (2014AA10A601-5), the Natural ScienceFoundations of Jiangsu Province (BK20150010), the Natural Science Foundationofthe Jiangsu Higher Education Institutions (14KJA210005), and the Innovative Research Team of Universities in Jiangsu Province (KYLX_1352).

Fig. 3–Comparison of joint association analysis under different residual correlation coefficients. Heritability equals 30% (a), 50% (b), and 70% (c).

R E F E R E N C E S

[1] C.S. Zhu, M. Gore, E.S. Buckler, J.M. Yu, Status and prospects of association mapping in plants, Plant Genome 1 (2008) 5–20.

[2] S.A. Flint-Garcia, J.M. Thornsberry, E.S. Buckler IV, Structure of linkage disequilibrium in plants, Annu. Rev. Plant Biol. 54 (2003) 357–374.

[3] P.K. Gupta, S. Rustgi, P.L. Kulwal, Linkage disequilibrium and association studies in higher plants: present status and future prospects, Plant Mol. Biol. 57 (2005) 461–485.

[4] J.M. Thornsberry, M.M. Goodman, J. Doebley, S. Kresovich, D. Nielsen, E.S. Buckler, Dwarf8 polymorphisms associate with variation in flowering time, Nat. Genet. 28 (2001) 286–289.

[5] G.Q. Jia, X.H. Huang, H. Zhi, Y. Zhao, Q. Zhao, W.J. Li, Y. Chai, L.F. Yang, K.Y. Liu, H.Y. Lu, C.R. Zhu, Y.Q. Lu, C.C. Zhou, D.L. Fan, Q.J. Weng, Y.L. Guo, T. Huang, L. Zhang, T.T. Lu, Q. Feng, H.F. Hao, H.K. Liu, P. Lu, N. Zhang, Y.H. Li, E.H. Guo, S.J. Wang, S.Y. Wang, J.R. Liu, W.F. Zhang, G.Q. Chen, B.J. Zhang, W. Li, Y.F. Wang, H.Q. Li, B.H. Zhao, J.Y. Li, X.M. Diao, B. Han, A haplotype map of genomic variations and genome-wide association studies of agronomic traits in foxtail millet (Setaria italica), Nat. Genet. 45 (2013) 957–961.

[6] X.H. Huang, X.H. Wei, T. Sang, Q. Zhao, Q. Feng, Y. Zhao, C.Y. Li, C.R. Zhu, T.T. Lu, Z.W. Zhang, M. Li, D.L. Fan, Y.L. Guo, A.H. Wang, L. Wang, L.W. Deng, W.J. Li, Y.Q. Lu, Q.J. Weng, K.Y. Liu, T. Huang, T.Y. Zhou, Y.F. Jing, W. Li, Z. Lin, E.S. Buckler, Q. Qian, Q.F. Zhang, J.Y. Li, B. Han, Genome-wide association studies of 14 agronomic traits in rice landraces, Nat. Genet. 42 (2010) 961–967.

[7] J.B. Yan, M. Warburton, J. Crouch, Association mapping for enhancing maize (Zea mays L.) genetic improvement, Crop Sci. 51 (2011) 433–449.

[8] H. Li, Z.Y. Peng, X.H. Yang, W.D. Wang, J.J. Fu, J.H. Wang, Y.J. Han, Y.C. Chai, T.T. Guo, N. Yang, J. Liu, M.L. Warburton, Y.B. Cheng, X.M. Hao, P. Zhang, J.Y. Zhao, Y.J. Liu, G.Y. Wang, J.S. Li, J.B. Yan, Genome-wide association study dissects the genetic architecture of oil biosynthesis in maize kernels, Nat. Genet. 45 (2013) 43–50.

[9] F. Zhang, X. Guo, H.W. Deng, Multilocus association testing of quantitative traits based on partial least-squares analysis, PLoS ONE 6 (2011), e16739, http://dx.doi.org/10.1371/journal. pone.0016739.

[10] K. Wang, D. Abbott, A principal components regression approach to multilocus genetic association studies, Genet. Epidemiol. 32 (2008) 108–118.

[11] J. Akey, L. Jin, M.M. Xiong, Haplotypes vs single marker linkage disequilibrium tests: what do we gain? Eur. J. Hum. Genet. 9 (2001) 291–300.

[12] W.J. Gauderman, C. Murcray, F. Gilliland, D.V. Conti, Testing association between disease and multiple SNPs in a candidate gene, Genet. Epidemiol. 31 (2007) 383–395.

[13] P. Marttinen, J. Corander, Efficient Bayesian approach for multilocus association mapping including gene–gene interactions, BMC Bioinf. 11 (2010) 443.

[14] R.R. Hocking, The analysis and selection of variables in linear regression, Biometrics 32 (1976) 1–49.

[15] I.G. Chong, C.H. Jun, Performance of some variable selection methods when multicollinearity is present, Chemom. Intell. Lab. Syst. 78 (2005) 103–112.

[16] A.E. Hoerl, R.W. Kennard, Ridge regression: biased estimation for nonorthogonal problems, Technometrics 12 (1970) 55–67.

[17] R. Tibshirani, Regression shrinkage and selection via the lasso, J. R. Stat. Soc. Ser. B Methodol. 58 (1996) 267–288.

[18] H. Zou, T. Hastie, Regression shrinkage and selection via the elastic net, with applications to microarrays, J. R. Stat. Soc. Ser. B Methodol. 67 (2005) 301–320.

[19] H. Zou, T. Hastie, Regularization and variable selection via the elastic net, J. R. Stat. Soc. Ser. B Methodol. 67 (2005) 301–320.

[20] Y.F. Shen, J. Zhu, Power analysis of principal components regression in genetic association studies, J. Zhejiang Univ. Sci. B 10 (2009) 721–730.

[21] G. Palermo, P. Piraino, H.D. Zucht, Performance of PLS regression coefficients in selecting variables for each response of a multivariate PLS for omics-type data, Adv. Appl. Bioinforma. Chem. 2 (2009) 57–70.

[22] H. Abdi, L.J. Williams, Principal component analysis, WIREs Comput. Stat. 2 (2010) 433–459.

[23] T. Mehmood, K.H. Liland, L. Snipen, S. S?b?, A review of variable selection methods in partial least squares regression, Chemom. Intell. Lab. Syst. 118 (2012) 62–69.

[24] P. Geladi, B.R. Kowalski, Partial least-squares regression: a tutorial, Anal. Chim. Acta 185 (1986) 1–17.

[25] A. H?skuldsson, PLS regression methods, J. Chemom. 2 (1988) 211–228.

[26] S. de Jong, SIMPLS: an alternative approach to partial least squares regression, Chemom. Intell. Lab. Syst. 18 (1993) 251–263.

[27] R.B. O'Hara, M.J. Sillanp??, A review of Bayesian variable selection methods: what, how and which, Bayesian Anal. 4 (2009) 85–117.

[28] E.I. George, R.E. McCulloch, Variable selection via Gibbs sampling, J. Am. Stat. Assoc. 88 (1993) 881–889.

[29] N. Yi, V. George, D.B. Allison, Stochastic search variable selection for identifying multiple quantitative trait loci, Genetics 164 (2003) 1129–1138.

[30] C.J. Jiang, Z.B. Zeng, Multiple trait analysis of genetic mapping for quantitative trait loci, Genetics 140 (1995) 1111–1127.

[31] S. Kim, E.P. Xing, Statistical estimation of correlated genome associations to a quantitative trait network, PLoS Genet. 5 (2009), e1000587.

[32] Y.H. Yin, Q.Z. Zhang, M.Q. Liu, A two-stage variable selection strategy for supersaturated designs with multiple responses, Front. Math. 8 (2013) 717–730 (China).

[33] J.P. Andries, Y.V. Heyden, L.M. Buydens, Predictive-propertyranked variable reduction with final complexity adapted models in partial least squares modeling for multiple responses, Anal. Chem. 85 (2013) 5444–5453.

[34] S. Xu, Genetic mapping and genomic selection using recombination breakpoint data, Genetics 195 (2013) 1103–1115.

[35] T.T. Wu, Y.F. Chen, T. Hastie, E. Sobel, K. Lange, Genome-wide association analysis by lasso penalized logistic regression, Bioinformatics 25 (2009) 714–721.

[36] Y. Guan, M. Stephens, Bayesian variable selection regression for genome-wide association studies and other large-scale problems, Ann. Appl. Stat. (2011) 1780–1815.

[37] A. Ciampi, L. Yang, A. Labbe, C. Mérette, PLS regression and hybrid methods in genomics association studies, in: H. Abdi, W.W. Chin, V.E. Vinzi, G. Russolillo, L. Trinchera (Eds.), New perspectives in partial least squares and related methods, Springer Proc. Math. Stat., 56, Springer, New York 2013, pp. 107–116.

[38] S.Wold,Exponentiallyweightedmovingprincipalcomponents analysis and projections to latent structures, Chemom. Intell. Lab. Syst. 23 (1994) 149–161.

[39] G. Schwarz, Estimating the dimension of a model, Ann. Stat. 6 (1978) 461–464.

[40] D. Botstein, R.L. White, M. Skolnick, R.W. Davis, Construction of a genetic linkage map in man using restriction fragment length polymorphisms, Am. J. Hum. Genet. 32 (1980) 314.

[41] The R Core Team, R: a language and environment for statistical computing, R Foundation for Statistical Computing, Vienna, Austria, 2015.

[42] P.F. O'Reilly, C.J. Hoggart, Y. Pomyen, F.C. Calboli, P. Elliott, M.R. Jarvelin, L.J. Coin, MultiPhen: joint model of multiple phenotypes can increase discovery in GWAS, PLoS ONE 7 (2012), e34861.

[43] J. Xiao, X. Wang, Z. Hu, Z. Tang, C. Xu, Multivariate segregation analysis for quantitative traits in line crosses, Heredity 98 (2007) 427–435.

[44] S.R. Narum, Beyond Bonferroni: less conservative analyses for conservation genetics, Conserv. Genet. 7 (2006) 783–787.

[45] S. Holm, A simple sequentially rejective multiple test procedure, Scand. J. Stat. (1979) 65–70.

[46] J.P. Shaffer, Modified sequentially rejective multiple test procedures, J. Am. Stat. Assoc. 81 (1986) 826–831.

[47] B.S. Holland, M.D. Copenhaver, An improved sequentially rejective Bonferroni test procedure, Biometrics 43 (1987) 417–423.

[48] Z.B. Zeng, C.H. Kao, C.J. Basten, Estimating the genetic architecture of quantitative traits, Genet. Res. 74 (1999) 279–289.

The Crop Journal2016年1期

The Crop Journal的其它文章: 7th International Crop Science Congress Announcement; Editorial Board of The Crop Journal; Comparisons of phaseolin type and α-amylase inhibitor in common bean (Phaseolus vulgaris L.) in China; Genotypic variation for seed protein and mineral content among post-rainy season-grown sorghum genotypes; Fosmid library construction and screening for the maize mutant gene Vestigial glume 1; Intra-population genetic variance for grain iron and zinc contents and agronomic traits in pearl millet

亚洲免费av电影一区二区三区,日韩爱爱视频,51精品视频一区二区三区,91视频爱爱,日韩欧美在线播放视频,中文字幕少妇AV,亚洲电影中文字幕,久久久久亚洲av成人网址,久久综合视频网站,国产在线不卡免费播放

A multivariate partial least squares approach to joint association analysis for multiple correlated traits

1. Introduction

2. Methods

3. Results

4. Discussion

5. Conclusion

Acknowledgments