Charles Hawkins,Long-Xi Yu*
United States Department of Agriculture-Agricultural Research Service,Plant Germplasm Introduction and Testing Research,Prosser,WA 99350,USA
Keywords:Alfalfa Genomics GWAS GBS Breeding
ABSTRACT Alfalfa(M.sativa L.)is a highly valuable forage crop,providing>58 Mt of hay,silage,and pasture each year in the United States.As alfalfa is an outcrossing autotetraploid crop,however,breeding for enhanced agronomic traits is challenging and progress has historically not been rapid.Methods that make use of genotypic information and statistical models to generate a genomic estimated breeding value(GEBV)for each plant at a young age hold a great deal of promise to accelerate breeding gains.An emerging genomic breeding pipeline employs SNP chips or genotyping-by-sequencing(GBS)to identify SNP markers in a training population,followed by the use of a statistical model to find associations between the discovered SNPs and traits of interest,followed by genomic selection(GS),a breeding program utilizing the trained model to predict breeding values and making selections based on the estimated breeding value(EBV).Much work has been done in recent years in all of these areas,to generate marker sets and discover SNPs associated with desirable traits,and the application of these technologies in alfalfa breeding programs is under way.However,GBS/GWAS/GS is still a new breeding paradigm,and work is ongoing to evaluate different models,software,and methods for use in such programs.In this review,we look at the progress of alfalfa genomics over the past halfdecade,and review work comparing models and methods relevant to this new type of breeding strategy.
Alfalfa(Medicago sativa L.)is a legume forage crop grown primarily for hay,silage,and pasture.In 2016,the United States produced 58,263,000 t of alfalfa[1].Cultivated alfalfa is an outcrossing autotetraploid(2n=4×=32)with a basic chromosomenumberofeightandagenomesizeof 800–1000 Mb[2].Alfalfa plants are highly heterozygous and exhibit severe inbreeding depression,precluding the development of inbred lines[3].It is a considerable challenge to genotype individualswith such acomplicated genome.However,recent advances in next-generation sequencing have provided a new strategy to generate cost-effective,high-density,genome-wide single nucleotide polymorphism(SNP)sets[4].In conjunction with genome-wide association studies(GWAS)and/or genomic selection(GS)[5],more powerful platforms can be developed to improve gains in alfalfa breeding.Given that alfalfa cultivars are genetically broad-based synthetic populations,they provide an ideal system in which genotyping by sequencing(GBS),association mapping(AM),and GS can be applied.Several reports have been published in recent years describing the generation of marker sets in alfalfa and performing association and QTL mapping,and others have compared different computational methods for use in genomic selection.In this article,we review work in both these areas.
Alfalfa breeding programs have focused on forage yield and quality,resistance to biotic and abiotic stressors,and fall dormancy.Conventional breeding is typically based on simple phenotypic selection,in which each individual plant must be phenotyped,and on pedigree-based methods such as BLUP[6]which attempt to predict individual breeding values based on pedigree.Progress using conventional selection has been slow,and modern programs are turning to breeding techniques based on genotyping,including marker-assisted selection(MAS)and genomic selection which offer the promise of more rapid breeding cycles and fewer necessary phenotypic evaluations.In any such program,a training population of alfalfa must be genotyped for a set of markers,and then the same population must be phenotyped for the trait(s)of interest.A statistical model is used to associate the markers with the traits.This model is later applied to the breeding population to predict the breeding values of individuals based on their genotype,and plants are selected for breeding based on the genomic estimated breeding value(GEBV).Because it relies only on a plant's genotype,which can be determined when the plant is young,a GEBV allows for shorter selection cycles than are required in phenotypic selection.The model does need to be kept updated with further phenotypic testing as the population changes,but this can be done concurrently with the selection cycle.
Genomic selection(also referred to as genome-wide selection,GWS)[5,7,8]is a breeding technique wherein selection of individuals for breeding is made based on genotyping many polymorphic sites(markers)using a statistical model that has linked these markers to one or more desirable traits(Fig.1).The model must first be trained using a training population that is genotyped at many loci,with the goal of obtaining as much coverage of the genome as possible,and phenotyped.The statistical model is used to discover associations between alleles at the genotyped loci and the phenotype(s)being measured,such as yield or resistance to biotic or abiotic stressors.GS associates traits directly with genotypes and does notrelyon kinship.UnlikeconventionalMAS,a technology that identifies a small set of the markers with the most significant associations with the trait and selects breeding candidates with the goal of driving the desirable alleles to fixation,selection in GS is based on a model incorporating all loci identified as polymorphic.Compared with MAS,GS is better at identifying larger numbers of smaller-effect loci and has produced promising results in many crop species[5].
Effectiveness of GS and MAS is potentially confounded by genotype×environment(G×E or GE)interactions,such that the effect of a given genotype on performance measures is different in different environments and phenotype therefore cannot be expressed as a linear combination of genotypic and environmental effects.This means that varieties bred in one environment may not perform as well(relative to the unimproved stock)when grown in a different environment[9,10].Conducting trials under multiple conditions is the only way to uncover these effects,though it is not necessary to test all genotypes in all environments;Burgue?o et al.[10]presented a set of factor-analytic multienvironment models incorporating pedigree information,markers,both,or neither,and evaluated them based on correlation between predicted andactualphenotypicmeasuresin same-andcross-environment studies using wheat data.The wheat used in the study was Triticum aestivum L.(common wheat),a selfpollinatinghexaploid (2n=6x=42).Lyetal.[9]tested differentcross-validation schemes in cassava,a selfpollinating diploid(2n=36),and found that genetic relatedness and environmental homogeneity between the training and validation sets can result in overestimation of prediction accuracies.
Fig.1–GBS pipeline.Illustration of a standard pipeline for genotyping-by-sequencing(GBS).Beginning with a population of plants(A),each plant has its genomic DNA extracted(B),generally into multi-well plates.The DNA is digested with one or two restriction enzymes selected for good genomic coverage(C),such as ApeKI or EcoT22I.The digested DNA is ligated to a set of barcode adapters,with a unique adapter for each sample(D).The barcoded fragments are then pooled(E),one pool per plate,and the pools are sequenced with a high-throughput platform such as HiSeq(F).Sequencing coverage of 15×is recommended for calling presence–absence genotypes,or 60× for calling full tetraploid dosage(G).A variant-calling pipeline is used to call variants;a simplified illustration of the FreeBayes pipeline is given here;the blue boxes contain the names of the software used to perform each step(H).The variants are then filtered to remove false-positive calls caused by sequencing errors(I).
More than one genotyping technology may be used with genomic selection.Gene chip arrays were a popular option for early efforts,but as the costs of high-throughput sequencing have come down,sequencing-based genotyping approaches have become more viable.At present,it is not cost-effective to sequence whole genomes to a depth sufficient for genomic selection,so reduced-representation approaches are used instead.These approaches attempt to sequence only a fraction of the genome,but at fixed points with broad coverage,with the goal that any variant loci discovered will,if not causal themselves,at least be in linkage disequilibrium(LD)with loci that are.Currently,genotyping-by-sequencing(GBS)is the most popular reduced-representation approach for genomic selection in crop species.This approach uses selected restriction enzymes such as methylation-sensitive enzymes to cut the genome at fixed points,and sequences the ends of the restriction fragments for genotyping.
Most GBS follows the approach described by Elshire et al.[4].In this type of GBS,genomic DNA is extracted from each individual in the population under study.The DNA is digested with one or two restriction enzymes selected for cut sites that show good coverage of the genome.ApeKI and EcoT22I are commonly-used enzymes,with an average of 1 cut site per 1.5 kb and 1 cut site per 2 kb,respectively,in the M.truncatula Gaertn.(barrel medic)genome assembly Mt4.0v1[11].Each digested sample is ligated to a unique barcode adapter,to enable later demultiplexing of samples in silico.The samples are combined,amplified by PCR,and sequenced using a highthroughput next-generation sequencing platform,most commonly Illumina's HiSeq platform using a sequencing primer binding site common to the adapters.Single-ended,100-bp sequencing is most common,giving an expected overall coverage of~1/7 of the genome.One of a variety of pipelines is then used to call SNPs and multiple nucleotide polymorphisms(MNPs),identifying polymorphic sites and calling genotypes.A coverage of 48–60× is considered sufficient to call gene dosage at a site[12],but only 15×is required to call presence–absence[13].The data may be used in a genomewide association study with phenotypic data taken from the sampled individuals to identify variant loci linked to traits of interest.Because genome coverage is incomplete,most of the polymorphisms identified are unlikely to be causal,but are likely instead merely to be in LD with loci that are causal.This is sufficient for the application of molecular breeding techniques such as MAS and GS.As a consequence of the use of linkage to estimate breeding values,however,recombination between the causal and marker loci will occur over the course of a breeding program,reducing the effectiveness of the predictive models or discovered associations over time.The models or associations should be updated with further phenotyping over the course of selection in order to mitigate this problem.
Much work has been done in recent years in alfalfa genomics.Numerous studies have generated extensive sets of markers,including SNPs,MNPs,small indels,and SSR markers,based in large part on methods using high-throughput sequencing of the alfalfa genome and transcriptome.Many more have used such marker sets to identify sets of markers associated with desirable agronomic traits such as yield and resistance to biotic and abiotic stressors.Here we review work done since 2012,as Li and Brumner[14]have published a good review of progress up to that point.
Because of the importance of markers for both MAS and GS,numerous studies have generated sets of markers for alfalfa.Some studies have used transcriptome sequencing or other EST-based methods to produce sets of markers.Others have used genomic PCR-based methods such as SRAP[15],a system using semi-randomized primers designed to target exons;AFLP[16],where primer extensions are used to selectively amplify restriction fragments;and SCoT,[17]where primers are designed to target the common sequence upstream of plant start codons.
Li et al.[18]used RNAseq to sequence the transcriptomes of alfalfa stem tissue from a diverse set of plants from both elite and unimproved populations.Stems were used because the study was focused on the use of alfalfa as a biofuel feedstock.The study identified 871,384 SNPs and 31,760 indels,and used principal components analysis(PCA)to probe the relationships among the different varieties.The elitepopulationswerefoundnottohavesignificantly diminished diversity compared with the unimproved wild plants.The same group later produced a SNP array based on this set of loci[19].Liu et al.[20]studied transcriptomes from 15 tissue types from the “Golden Queen”alfalfa cultivar,including germinated seeds,cotyledons,leaves,stems,inflorescencesandpods.Thestudyfound40,433unigenes(transcripts assembled from RNA sequence data)and identified 1694 SSR markers(2–6 nucleotides,≥5 repeats).Wang et al.[21]sequencedRNAfromtheshootsof36alfalfa accessions,along with one each of M.sativa ssp.falcata and M.truncatula.The study found 54,278 unigenes,among which 4493 SSR markers were discovered,and mapped the markers to the M.truncatula genome.Zhou et al.[22]scanned the sequences of 40,433 previously identified unigenes for SSRs in silico.The study discovered 750 SSRs and analyzed ten alfalfa accessions for polymorphisms.Of the 512 primer pairs that produced usable results,204 showed polymorphisms.
GBS-based marker sets and associated linkage maps can offer better coverage than transcriptome-based sets,as they are not limited to transcribed regions.Although wholegenome sequencing(WGS)would offer more still,it is not yet sufficiently economical for large-scale marker discovery.Li et al.[23]constructed a GBS-based linkage map in alfalfa using a biparental F1population of 384 progeny,based on 3591 SNP markers discovered across the eight alfalfa linkage groups.The reference M.truncatula genome was not used in SNP calling or construction of the map,but the SNPs were mapped back to it after they were called,revealing a high degree of synteny between the alfalfa linkage map and the M.truncatula physical map.Liu et al.[24]focused specifically on EST-SSR markers in known transcription factors,TF genederived microsatellite markers(TFGMs)in M.truncatula.The study found 203 SSRs among 176 transcription factors and designed 185 primer pairs to test for them.Of these,121 were found to be transferable to alfalfa,and the study tested 44 alfalfa accessions for 35 of the SSRs(randomly selected from the 121),and found 27 of them to be polymorphic.Han et al.[25]developed an Illumina iSelect Infinium array of 9277 SNP markers identified through sequencing.The study used this array to genotype a biparental cross and was able to integrate 2746 of the SNP markers into an existing linkage map.Annicchiarico et al.[26]performed GBS on 11 landraces from northern Italy,using three bulks of 100 plants for each landrace(3300 plants in total),and identified 2902 SNP markers that were polymorphic among the landraces.These,along with 62 EST-SSR markers and 11 morphophysiological traits,were used to explore the distinctiveness and distinguishability of the landraces.
A complete genome sequence for alfalfa has not yet been completed;hence the reliance on the related M.truncatula genome as a reference for work in alfalfa.Work to produce a genomic sequence for M.sativa is,however,under way in the form of the Cultivated Alfalfa at the Diploid Level(CADL)project[27].
Studies to identify markers associated with phenotypic traits can provide a pool of significant markers for use with MAS.These studies may use previously-identified markers or generate their own sets.There are two approaches to marker-trait association:QTL mapping and association mapping(Also known as LD mapping or GWAS).QTL mapping is performed with a biparental mapping population,resulting in longer areas of linkage.Association mapping is performed with arandomly intermatedpopulation orapanel of germplasm from different sources.
Though association mapping is the newer technique,QTL mapping continues to find useful application,particularly for SSR,RFLP,and other marker types not based on large-scale genome sequencing.Khu et al.[28]identified QTL associated with aluminum tolerance using a biparental cross between a susceptible and a resistant parent.The study utilized 1024 SSR markers(755 previously known markers plus 269 markers newly developed from unigene sequences)genotyped via PCR,and phenotyped for petiole-derived callus growth and for total root length ratio(TRLR)of stem cuttings grown in media.Of the 257 markers found to be polymorphic between the parent plants,20 showed a significant association with callus growth under aluminum exposure and 41 with TRLR.A linkage map was also constructed using 275 of the markers.Zhangetal.[29]genotyped352individualsfromtwo biparental crosses for 172 previously-identified SNP markers via high-resolution melting(HRM),and phenotyped them for susceptibility to Verticillium wilt(VW).Twenty-one markers showed significant association with VW resistance.McCord et al.[30]mapped QTL associated with yield,lodging resistance,and spring vigor using a hybrid of lodging-resistant and-susceptible plants backcrossed to the susceptible parent line,obtaining a mapping population of 128 plants.Two hundred and thirty-six markers(58 SRAP,142 AFLP,36 SSR)were used to make the linkage map.The study found three significant QTL for lodging resistance,seven for forage yield(two major,meaning each explains>10%of variation,and five minor),and four for spring vigor.Li et al.[31]discovered QTL associated with fall dormancy via QTL mapping in an F1population from a cross between fall-dormant and nondormant alfalfa varieties.Two hundred individuals were phenotyped for fall dormancy and winter hardness and genotyped for 601 alleles at 249 previously-identified markers(78 RFLP,123 SSR,48 SNP)and the two were tested for association via a generalized linear model.The study found 71 QTL associated with fall dormancy and related traits,and built an updated linkage map.Ray et al.[32]genotyped additional progeny from a previous pair of backcrosses between droughtsusceptible and drought tolerant plants[33],phenotyping a total of 253 plants for yield under drought.Genotyping was performed using existing SSR markers and novel SNPs identified using a candidate-gene approach.A total of 600 markers were used(43 candidate-gene SNPs and 557 SSR markers),and a linkage map was constructed.Fourteen SNPs and 9%of the SSR markers showed significant association with yield under drought.
A number of association mapping studies have also been conducted in recent years. ?akiro?lu et al.[34]scored 89 known SSR markers in 374 individuals from 120 accessions usingPCRandphenotypedtheindividualsforseveral agronomic traits related to the cell wall and biomass yield.A mixed linear model was used to test marker-trait association,and three significant markers were found.The study also examined LD across the genome and found LD varying between 500 bp and 1 kb in different regions.Dubé et al.[35]performed SRAP amplification on a selection of genotypes from 619 alfalfa cultivars,searching for markers associated with two biofuel-relevant traits,cell wall degradability and stem lignification.Forty-two markers were identified with significant associations with these traits.Zhang et al.[36]genotyped apanelof198accessionsusingGBSand phenotyped for drought resistance index,and performed GWAS using a mixed linear model.The study identified 26,163 SNP and MNP markers and found 19 of these to be significantly associated with drought tolerance.Yu et al.[37]used GBS to genotype 198 plants from the same panel of drought-tolerant accessions and phenotyped them for germination rate under salt stress.The study identified 10,327 SNPs,of which 4653 passed filtering for use in the GWAS.Of them,23 showed significant association with salt tolerance during germination.Liu and Yu[38]continued with the same panel of accessions,assessing them for yield under salt stress,and identified 42 SNPs significantly associated with salt tolerance.Yu et al.[39]used similar methodology to look for markers associated with resistance to Verticillium wilt(Verticillium alfalfae).A panel of 179 breeding lines was phenotyped for VW resistance and genotyped with GBS.The study found 169,008 markers,of which 19,801 passed filtering.Of these,10 were significantly associated with VW resistance.Yu et al.[40]genotyped a full-sib F1population of 188 individuals,from a crossbetween two parents resistant and susceptible to Verticillium wilt,using GBS and variants called using three different pipelines.The plants were phenotyped for VW resistance and a mixed linear model was used to discover associations.The study compared the results of three GBS pipelines(see “Software Pipelines for Variant Calling”below).The three marker sets generated by the three pipelines were 10,403,24,176,and 14,415 SNPs in size,and within these marker sets,21,22,and 13,respectively,were found to have significant association with VW resistance. ?akiro?lu and Brumner[41]discovered 15,154 SNP markers in a population of 362 plants from 120accessionsofdiploid alfalfausingGBS,and phenotyped them for 23 traits related to yield and nutritive value.Testingusingamixedlinearmodelrevealed65SNPswith significant association with one or more of these traits.The study searched for genes underlying these loci and found five plausible candidates involved in growth,development and cell wall biogenesis.Biazzi et al.[42]used GBS in a population of 154 plants from a cross between three cultivars and identified 11,450 SNPs.The study phenotyped the plants for nutritive quality and used them to discover associations with 8494 of the SNPsusing a mixedlinearmodel.Eighty-threeSNPswerefound to be significantly associated with the nutritive-quality traits.Jia et al.[43]conducted association mapping for protein and mineral content on a panel of 336 individuals from 75 accessions.The plants were genotyped for 85 SSR markers and phenotyped for protein content and several minerals.Eightyfive marker-trait associations were found.Yu[44]conducted association mapping for yield under drought stress on a panel of 200 alfalfa accessions from the National Plant Germplasm System.Using GBS,a set of 10,327 SNP markers was obtained.Theplantsweregrownunderdroughtconditionsand phenotyped for yield.The study found 28 SNP markers to be significantly associated with yield under drought(FDR of 0.05).
Genomic selection is beginning to be used in breeding programsfor agronomic traitsin alfalfa.Li et al.[45]genotyped an alfalfa population developed from a mixture of commercial cultivars using GBS,identifying 175,000 SNPs,and phenotyped it for dry yield.Associations between the two were tested using a mixed linear model.The population was subjected to two cycles of genomic selection,and prediction accuracies of 0.43 to 0.66 were obtained.Annicchiarico et al.[46]performed GBS on a total of 278 individuals from two populations(Po Valley and Mediterranean,124 and 154 plants,respectively),and identified 68,972 and 77,610 SNP markers,respectively.The study took biomass yield data from the populations and used a cross-validation scheme to evaluate the accuracy of several different prediction models with different marker imputation methods under different levels of simulated missing genotypes and different filtering parameters(see “Marker Imputation and Association Models”below for more discussion of this aspect of the study).Accuracies of up to 0.32 and 0.35 were obtained,respectively,for the two populations.These scores are not high,a result the authors attributed to a prevalence of non-additive genetic effects known to influence biomass yield,but they argued that at the low narrow-sense heritability observed for the trait(0.22),GS can still be expected to outperform conventional phenotypic selection.Additionally,previous work has suggested that even at accuracies as low as 0.2–0.3,GS may still outperform MAS in terms of gain per unit time[47],and more so at low heritabilities,as was the case here.
The pipeline employing genotyping-by-sequencing followed by genome-wide association or genomic selection,is useful as a generalstructureformolecular-breedingprojects.In implementing such a scheme,however,numerous choices of software,statistical methods,and details of the breeding program are available.For most of these,no one option has yet emerged as the best in all circumstances,and work has been ongoing to compare existing techniques and develop new ones.Here we will examine competing techniques and software,and recent studies that have sought to compare them,in several areas:the choice of software pipeline used to call SNP and other variant sites based on sequencing reads,the choice of algorithm for imputing missing marker genotypes,the choice of statistical model for marker–trait association,and approaches to managing inbreeding and short-vs long-term gain in genomic selection(Fig.2).
Popular pipelines for calling variants are TASSEL GBS v1[48],UNEAK,and haplotype-based pipelines such as Freebayes[49].The former two use the TASSEL software package.Other variant callers include GATK,Stacks,and mpileup from Samtools.All of the callers except UNEAK require that reads be aligned to a reference sequence,a procedure usually performed with the Burrows-Wheeler Aligner(BWA)[50].Little to no discernible advantage has been found in calling alfalfa SNPs with reference to the M.truncatula genome,however[23],and the reference-free UNEAK pipeline is very popularforGBS/GWAS/GSstudiesin thisspecies(e.g.[23,41,42,45,46]).More recently,Yu et al.[40]compared the UNEAK pipeline,the reference-based TASSEL-GBSv1 pipeline,and the Freebayes pipeline in calling variants in a population of 188 F1progeny from a biparental cross.The total number of called variants were 14,415 for UNEAK,24,167 for TASSELGBSv1,and 10,403 for Freebayes.Each marker set was used independently for GWAS with the same set of genotyping data and comparable results were found.
In the process of calling variant sites,it is necessary to distinguish true variants from apparent variants arising from sequencing errors.This may be accomplished by choosing the requirements to call a variant site and by filtering after the fact.Variants are typically filtered by q-score,by read depth,by minor allele frequency,by fraction of missing calls,and,depending on the population type and calling method,based on allele frequency relative to the parents and on LD with nearby markers.
Fig.2–Genomic selection pipeline.Genomic selection is a molecular-breeding technique in which genotyping and phenotypic evaluations of a training population are used to train a statistical model that then makes predictions of the breeding value of individuals in the breeding population based on their genotypes.These genomic estimated breeding values(GEBVs)are used to select families to be intermated and advanced to the next generation.
Phred quality score,or q-score,is a quality measure that estimates the probability that a base was called incorrectly,given on a negative log scale(Q=-log10P(incorrect))so that a higher q-score indicates a more confident base call.Sequencing technology such as Illumina assigns a q-score to each called base in a read,and variant calling software uses the qscore from the reads used for calling a given variant site to synthesize an overall q-score for the site estimating the probability that the call is incorrect.A typical filtering value is 20,for a probability of 1%that the call of a variant at this site is due to sequencing errors.
To accurately call genotypes at a variant site it is necessary that the depth of coverage at that site be sufficient.In the case of a tetraploid organism like alfalfa,a read depth of 15 is sufficient to differentiate a homozygote from a heterozygote,but a depth of 50 or 60 is needed to differentiate among the heterozygous dosages.In filtering for this distinction,genotypes with read depth below the threshold are marked as unknown.
Filtering by minor-allele frequency is also common.Alleles occurring at a low frequency in the population are less likely to be informative,and a common cutoff is 5%;loci for which either allele comprises less than this fraction of the samples are excluded from subsequent analysis.
Loci for which too many samples are of unknown genotype are not useful,and including them in analysis will add more noise than signal.A typical cutoff is 50%[46].
In the case of biparental populations,it is possible to filter loci based on the plausibility of the observed proportions of the different genotypes among the offspring.Based on the parental genotypes,the expected proportions of offspring genotypes are calculated and loci whose deviation from these ratios exceeds a significance threshold are discarded.A typical cutoff is P<0.001.This type of filtering is common in linkage-map construction and requires that loci with missing parental genotypes be discarded as well.
If a calling method that maps reads to a reference genome is used,then loci may be filtered based on LD with nearby loci.
Regardless of pipeline,GBS results leave a large number of plant genotypes uncalled.It is therefore necessary to impute the missing genotypes,and there are several methods for doing so.Beforehand,a cutoff is established and any locus with more than that number of missing genotypes is dropped;cutoffs in the range of 30%–50%have been found to be optimal for final accuracy of breeding value prediction[46].
Common methods for imputing missing genotypes include mean-value imputation(MNI),k-nearest neighbor imputation(kNNI),singular value decomposition imputation (SVDI),random forest imputation(RFI),expectation maximization imputation(EMI),localized haplotype clustering imputation(LHCI),and Fast inbred line library imputation(FILLIN)(Table 1).Several recent efforts have been made to compare the accuracy of these methods,evaluating them based on their ability to correctly call simulated missing data or the accuracy ofgenomicselection based on the resulting datasets[46,51,52].Rutkoski et al.[51]compared MNI,kNNI,SVDI,EMI,and RFI in GBS data from wheat(Triticum aestivum L.,2n=6x=42,self-pollinating),barley(Hordeum vulgare L.,2n=2x=14,self-pollinating),and maize(Zea mays L.,2n=2x=20,selfpollinating).RFI and kNNI were found to be the most accurate methods.Annicchiarico et al.[46]compared MNI,SVDI,RFI,and LHCI,on alfalfa GBS data,finding RFI to be the most accurate.Nazzicari et al.[52]compared kNNI,RFI,SVDI,MNI,LHCI,and FILLIN on GBS data from alfalfa and rice(O.sativa,2n=2x=24,self-pollinating).The study found that,in alfalfa,kNNI and RFI performed the best.
Table 1–Marker imputation techniques.
RFI and kNNI come out as the generally preferable methods in most of these studies,with RFI having a slight edge in accuracy but a higher computational cost than kNNI.Both also share the advantage of not requiring that the reads be mapped to a reference genome,as is the case for the haplotype-based methods,and thus may be used with variant data from the UNEAK pipeline.
Once a final set of variant markers has been obtained,a statisticalmodelmustbeusedtoattempttoidentify associations between these markers and the phenotype(s)that have been measured.There are many models to choose from(Table 2),and several recent studies have compared popular models.Heslot et al.[53]compared several on the basis of accuracy,as defined by the correlation between observed phenotypic values and the values predicted by the models.The study compared Bayesian and non-Bayesian ridge regression and lasso models,weighted Bayesian shrinkage,BayesCπ,empirical Bayes,RKHS,random forest regression(RF),support vector regression(SVR),and a hidden-layer artificial neural network trained to predict breeding values.Linear combinations of multiple models were also tested.The study ultimately recommended the Bayesian lasso and the weighted Bayesian shrinkage regression models for use in genomicselection,andcharacterizedrandomforestas promising but not sufficiently studied to recommend for use at the time.Annicchiarico et al.[46]also included a comparison of models,comparing non-Bayesian ridge regression,Bayesian lasso,BayesA,BayesB,and support vector regression by comparing predicted with actual breeding values based on phenotypes of half-sib progeny of plants from two different populations.SVRperformedwellhere,showinghigher accuracy than the other methods for the Po Valley dataset,with non-Bayesian ridge in second place.For the Mediterranean dataset,there was little consistent difference among the methods in terms of GEBV accuracy.Shikha et al.[54]tested non-Bayesian ridge,lasso,RKHS,random forest,elastic net,BayesA and BayesB on an inbred maize panel phenotyped forproductivity underdroughtstressatdifferentsitesin Telangana,India.The study found BayesB to give the most accurate predictions of phenotype in their maize population.
Table 2–Marker-trait association models.
From the varied results of these studies,it seems clear that the different models will give different relative performance in different populations and under different conditions;there does not appear to be any one model that is best in all circumstances.More research is needed to tease out the factors that affect the relative performance of the models,and to develop guidelines for the selection of a model given the population under study.Studies that systematically vary such factors as population size,type,and diversity,and test the prediction accuracy of the various models,will be useful in establishing these guidelines.Simulation studies may be useful here,as many populations can be simulated with parameters that are difficult to vary systematically in a real population,including the parameters of the trait under study such as heritability,distribution of marker effects,and prevalence of non-additive effects.Comparison with selected real-world datasets where genotyping and phenotyping has been applied to a real population can help verify that any relationships uncovered in simulation are likely to hold in real breeding programs.
In an individual genomic selection program,it is useful to perform cross-validation on the statistical model in order to ensure its accuracy and applicability to the population under study.Cross-validation is a technique wherein sub-samples of the training population(genotypes+phenotypes)are selected and each used to train the model[55].The model is then evaluated based on its ability to predict the phenotypes of the remainder of the population based on its genotypes.In each iteration,the subset used to train the model is called the training set and the remainder is called the validation set.The final measure is the correlation between the predicted and actual phenotypic measurements for all the selected subsets.It is possible,and given the lack of consensus on a single best model,desirable,to testmultiplemodelswith crossvalidation before selecting one for use in generating the GEBVs to be used for selection.The marker imputation methods are also amenable to testing by cross-validation.
There are several methods for selecting the subsets.In Monte Carlo cross-validation,the subsets are selected purely at random and are free to overlap,generally with 50%–80%used for the training set each time.In k-fold cross-validation,the data is partitioned into k subsets at random,and each such subset is used,in turn,as the validation set.Commonly,k is 5 or 10.In leave-one-out cross-validation,each individual,in turn,is used as the validation set.This is equivalent to kfold where k is equal to the population size.
Genomics in alfalfa continues to progress,with a large number of studies in the past five years producing large sets of markers,including SNPs and SSRs.The set of available markers associated with traits of agronomic interest continues to grow.Genomic selection has been shown to be a highly effective method for selection with estimated breeding values,and compares favorably with other EBV methods.There are,at present,many choices of model and method,and selection of the most effective choices depends on many factors of both the population and the breeding program,including population size and diversity,distribution of marker effects,and time horizon of the selection program.Many studies have compared methods and models in particular instances,but more systematic testing is needed to tease out the ways in which population parameters and choice of model together affect gains,and to produce useful heuristics for choosing among them given the selection program parameters and what is known of the population.
At present,some recommendations can be made for an alfalfa genomic selection breeding program.kNNI for marker imputation receives recommendations in two recent comparison studies(the third did not test it).RFI also performed well in all three studies,but its computational cost is much higher than that of kNNI.There is less consensus on association models,with different comparison studies arriving at different recommendations.Bayesian lasso,wBSR,SVR,and BayesB have all received recommendations,SVR from a study of alfalfa.As to pipeline,the TASSEL GBS,UNEAK,and Freebayes pipelines are comparable for variant calling in tetraploid alfalfa.A good approach is to test multiple models on the training set using cross-validation and choose the one with the highest accuracy.
Acknowledgments
This work was supported by the United States Department of Agriculture NIFA_AFRP(2015-70005-24071)and the Agricultural Research Service base fund.