Luyan Zhang,Lei Meng,Jiankang Wang*
The National Key Facility for Crop Gene Resources and Genetic Improvement,Institute of Crop Sciences,Chinese Academy of Agricultural Sciences,Beijing 100081,China
Keywords:Pure lines Four-w ay cross Eight-w ay cross Recom bination frequency estim ation Integrated softw are A B S T R A C T Pure lines derived from multiple parents provide abundant variation for genetic study.However,efficient genetic analysis methods and user-friendly softw are are still lacking.In this study,we developed linkage analysis methods and integrated analysis software for pure-line populations derived from four-way and eight-w ay crosses.First,polym orphic m arkers are classified into different categories according to the number of identifiable alleles in the inbred parents.Expected genotypic probability is then derived for each pair of complete markers,and based on them a maxim um likelihood estim ate(MLE)of recom bination frequency is calculated.An EM algorithm is proposed for calculating recombination frequencies in scenarios that at least one marker is incom plete.A linkage map can thus be constructed using estim ated recombination frequencies.We describe a software package called GAPLfor recombination frequency estimation and linkage map construction in multi-parental pure-line populations.Both simulation studies and results from a reported four-way cross recombinant inbred line population dem onstrate that the proposed m ethod and software can build more accurate linkage maps in shorter times than other published software packages.The GAPLsoftware is freely available from www.isbreeding.net and can also be used for QTLm apping in multi-parental populations.
Multi-parent Advanced Generation Inter Cross(MAGIC)populations are becom ing more and m ore com m on in genetic studies.Com pared w ith conventional bi-parental populations,m ultiparental populations harbor increased allelic and phenotypic diversity,leading to denser recom bination events and higher m apping accuracy[1-3].Compared w ith natural populations,kinship in progenies from m ulti-parental crosses is clear,so that there is no uncertainty of population structure[4].Pure-line populations can be repeatedly planted in m ultiple years and locations to increase accuracy of phenotyping and detection pow er for quantitative trait loci(QTL)and to perform QTL-byenvironment interaction analysis[5-8].These advantages have accelerated the development of m ulti-parental pure-line populations during the last decade.
Multi-parental pure-line designs w ere first proposed in m ice[9],and have since been applied in plant species.For exam ple,in Arabidopsis thaliana,Kover et al.[1]described the first set of MAGIC lines and developed analytical m ethods to fine-m ap QTL.Klasen et al.[10]evaluated the statistical pow ers of QTL detection in m ulti-parental recom binant inbred line(RIL)populations.In rice,Bandillo et al.[11]developed four m ulti-parental pure-line populations and used genome-wide association m apping for QTL identification.Ponce et al.[12]developed an eight-parent RIL population and used association m apping to detect QTL associated w ith cooking and eating quality in indica rice.In w heat,Huang et al.[13]constructed a linkage m ap in a four-parent RIL population.Würschum et al.[14]perform ed association m apping in a six-parent doubled-haploid(DH)triticale population.In soybean,Shivakum ar et al.[15]developed an eightparent MAGIC population by employing tw o-w ay,four-w ay,and eight-w ay intercross hybridization.In barley,Sannem ann et al.[16]incorporated m ulti-locus QTL analysis and cross validation for flow ering tim e in the first eight-parent DH population.
Linkage-m ap construction is a crucial step of genetic analysis,providing basic chrom osom al inform ation for m apbased gene cloning and m arker-assisted breeding[17].Linkage-analysis m ethodology has been less investigated in m ulti-parental pure-line populations than in bi-parental populations[18].The num ber of alleles and m arker types at each locus in multi-parental populations is much larger than that in bi-parental populations.This property com plicates m ethods for recom bination frequency estim ation and linkage m ap construction.To date,m ost studies aim ed at gene detection in multi-parental pure-line populations have been based on association m apping,w here no linkage m ap is needed.To fully exploit the potential of these populations in genetic studies,efficient and accurate methods for linkage analysis are needed.Som e R packages provide functions for linkage m ap construction in m ulti-parental RIL populations,including R/qtl[19],R/happy[20],and R/m p Map[21].How ever,these packages lack user-friendly interfaces,and the efficiency of linkage analysis m ethods in these tools has not been investigated system atically.In addition,these packages are not adapted to m ulti-parental DH populations.
In this study,w e focused on pure-line populations d erived from four-w ay and eight-w ay crosses.Our objectives w ere:(1)to estim ate recom bination frequencies betw een m arkers w ith various num bers of identifiable alleles and then construct a linkage map,(2)to develop an integrated softw are package for genetic analysis,and(3)to d em onstrate the advantages of the proposed m ethod and softw are by sim ulation studies and analysis of a reported four-w ay cross wheat RIL population.
Four inbred parental lines,parents A,B,Cand D,are needed to m ake a four-way cross.Tw o kinds of pure lines can be derived from a four-w ay cross.One,DH lines,is produced by em bryo rescue and pollen culture technology,and the other,RILs,is produced by repeated selfing and single-seed descent from the four-w ay cross F2(Fig.1).Som e m arkers m ay have four identifiable alleles in four parental lines,but others m ay have few er.According to the number of identifiable alleles in four parental lines,14 m arker categories m ay be defined for any polym orphic m arker:ABCD,AACD,ABCC,ABAD,ABCA,ABBD,ABCB,AACC,ABAB,ABBA,ABBB,ABAA,AACA,and AAAD[8].Markers belonging to category ABCD carry complete inform ation,w ith the four parents carrying four identifiable alleles,denoted by A,B,C,and D.The corresponding genotypes are denoted as AA,BB,CC,and DD,follow ing the Mendelian ratio of 1:1:1:1 in DHs or RILs w hen no distortion occurs.Markers belonging to the other 13 categories carry incom plete inform ation,such that the four alleles in the parents cannot be distinguished unam biguously.For exam ple,for a m arker belonging to category AACD,alleles in parents A and B are the sam e,but differ from the alleles in parents C and D.In derived DH or RIL populations,only three hom ozygous genotypes can be observed at a marker locus:AA,CC,and DD,follow ing the Mendelian ratio of 2:1:1 w hen no distortion has occurred.
Eight inbred parental lines,parents A,B,C,D,E,F,G,and H,are needed to m ake an eight-w ay cross.Similarly,DH lines and RILs can be produced from the eight-w ay cross(Fig.1).For pure lines from an eight-w ay cross,a total of 4139 m arker categories m ay be defined.Markers belonging to category ABCDEFGH represent the ideal situation,in w hich the parents carry eight identifiable alleles,denoted by A,B,C,D,E,F,G,and H.Their corresponding genotypes are denoted as AA,BB,CC,DD,EE,FF,GG,and HH.Markers belonging to the remaining categories are called incom plete loci.For exam ple,m arker categories m ay be AACDEFGH, ABCDEFGG,AAADEFGH,and so on.Missing genotypes in both populations are coded as XX.
For clarity,pure-line populations are denoted as 4PDH and 4PRIL w hen derived from four-w ay crosses,and 8PDH and 8PRIL w hen derived from eight-w ay crosses.
For 4PDH and 4PRIL,assume that marker loci 1 and 2 are linked,falling into one of the 14 categories previously described.Let A1,B1,C1,and D1denote the four alleles at locus 1 and A2,B2,C2,and D2the four alleles at locus 2.The one-meiosis recom bination frequency between the tw o loci is denoted as r.Based on the 14 m arker categories,105 scenarios m ay be considered to estim ate r.The ideal scenario is represented by the case of tw o com plete m arkers.Table 1 show s the theoretical probabilities of the 16 identifiable genotypes.The likelihood function(L)and natural logarithm of the likelihood(ln L)are given in Eq.(1)for 4PDH and Eq.(2)for 4PRIL,respectively.For 4PDH,and for 4PRIL,
where n1,n2,…,and n16are the sample sizes of the 16 genotypes and C is a constant independent of the unknow n recom bination frequency.For exam ples,n1to n4are the sam ple sizes of genotypes A1A1A2A2,A1A1B2B2,A1A1C2C2,and A1A1D2D2,respectively,and n13to n16are the sam ple sizes of genotypes D1D1A2A2,D1D1B2B2,D1D1C2C2,and D1D1D2D2,respectively.
Solving the likelihood equation by setting=0 yields the m axim um likelihood estim ate(MLE)of recom bination frequency given in Eqs.(3)and(4)for the tw o respective populations.For 4PDH,and for 4PRIL,
Table 1 - Probabilities of 16 pairwise marker types in pure-line populations from four-way crosses when both markersbelong to category ABCD. A1, B1, C1, and D1 are the four alleles at one marker locus. A2, B2, C2, and D2 are the four alleles at the other locus. r is the one-meiosis recombination frequency.
w here n is the total sam ple size,i.e.n=n1+…+n16.
Table 2 show s the theoretical probabilities of the 64 identifiable genotypes in the ideal scenario for 8PDH and 8PRIL.Let n1,n2,…,and n64be the sam ple sizes of the 64 genotypes.For exam ple,n1to n8are the sam ple sizes of genotypes A1A1A2A2,A1A1B2B2,…,A1A1H2H2;n57to n64are the sample sizes of genotypes H1H1A2A2,H1H1B2B2,…,H1H1H2H2.The MLEof recom bination frequency is show n in Eqs.(5)and(6)for the tw o respective populations.For 8PDH,
and for 8PRIL,^r is the solution of Eq.(6),
w here
The EM algorithm[22]is used for estim ating recom bination frequency in the other scenarios.The initial value of r is set at 0.25.In the Estep,the sam ple size niis calculated or updated by probabilities of genotypes in the ideal scenario.Consider,for exam ple,4PDH and 4PRIL,and the scenario in w hich one m arker category is ABCD and the other AACD.Locus 1 has four identifiable genotypes:A1A1,B1B1,C1C1,and D1D1,and locus 2 has three identifiable genotypes:A2A2+B2B2(i.e.A2A2or B2B2),C2C2,and D2D2.Table 3 shows the theoretical probabilities of the 12 identifiable genotypes,w ith sam ple sizes w ere represented by N1,N2,…,N12,respectively.
Eight of the 12 genotypes in Table 3 are exactly the sam e as in the ideal scenario in Table 1.The other four genotypes are com binations of tw o genotypes in Table 1.For exam ple,N1is the sam ple size of the genotype having A1A1at locus 1 and A2A2+B2B2at locus 2.The ratio of probabilities from genotypes A2A2and B2B2isr(1-r)=(1-r):r in both 4PDH and 4PRIL.So N1is divided into n1(corresponding to genotype A2A2at locus 2)and n2(corresponding to genotype B2B2at locus 2)in the ratio(1-r):r.Sim ilarly,N4is divided into n5(corresponding to genotype A2A2at locus 2)and n6(corresponding to genotype B2B2at locus 2)in the ratio r:(1-r);N7is divided into n9(corresponding to genotype A2A2at locus 2)and n10(corresponding to genotype B2B2at locus 2)in the ratio 1:1;and N10is divided into n13(corresponding to genotype A2A2at locus 2)and n14(corresponding to genotype B2B2at locus 2)in the ratio of 1:1.Thus,
In the M step,r is updated using Eqs.(3)and(4)and used to recalculate the probabilities of the 16 genotypes in Table 1.The EM iterations continue until the difference in r betw een tw o consecutive iterations reaches a predefined precision criterion of,by default,1×10-6.The MLE of recom bination frequency is thus obtained and then used for linkage m ap construction.A combination of a nearest-neighbor algorithm and a tw o-opt algorithm for the Traveling Salesm an Problem(TSP,[23])is used for m arker ordering,as in our previous studies[17,24,25].The nearest-neighbor algorithm is used to determ ine an initial solution and the tw o-opt algorithm is then used for im proving the solution.
The proposed m ethods for estim ating recom bination frequencies and building linkage m aps have been im plem ented in a softw are package nam ed“Genetic Analysis of Multi-parentalPure-line Populations”,or GAPL,w hich is freely available from w w w.isbreeding.net.GAPL is an integrated softw are package com bining linkage analysis,m ap construction,and QTL mapping for pure-line populations from four-w ay and eightw ay crosses.Core m odules for recom bination frequency estim ation,linkage-map construction,and QTL m apping algorithm s w ere w ritten in FORTRAN 90/95.The user interface for the softw are w as w ritten in C#.GAPLruns on Window s XP/Vista/7/8/10,w ith Microsoft.NET Fram ew ork 2.0(×86)or higher versions.GAPLis project-based softw are:all operations and files can be stored in projects,as w ith package QTL IciMapping for bi-parental populations[24]and package GACD for clonal F1and four-w ay crosses[25].
Table 2-Probabilities of 64 pairw ise m arker types in p ure-line populations from eight-w ay crosses w hen both m arkers belong to category ABCDEFGH.A1,B1,C1,D1,E1,F1,G1,and H1 denote the eight alleles at one marker locus.A2,B2,C2,D2,E2,F2,G2,and H2 denote the eight alleles at the other locus.r is the one-m eiosis recom bination frequency.
To investigate the efficiency of our m ethods,tw o chrom osomes were simulated.Tw enty markers w ere unevenly distributed on chrom osom e I.The m inim um and m axim um recom bination frequency betw een tw o neighboring m arkers w as 0.005 and 0.101,equivalent to 0.5025 c M and 11.2823 c M in m apping distance under the Haldane m apping function.One thousand each of 4PDH,4PRIL,8PDH,and 8PRIL populations consisting of 200 pure lines w ere sim ulated w ith the genetics and breeding sim ulation tool of Qu Line[26,27].No missing marker data points w ere simulated.For 4PDH and 4PRIL populations,seven m arkers w ere random ly chosen and assigned to category ABCD and the other 13 w ere random ly assigned to the other 13 categories.In the end,markers 1,5,8,12,17,18,and 20 belonged to category ABCD and the other m arkers belonged to incom plete categories.In the 8PDH and 8PRIL populations,all m arkers belonged to category ABCDEFGH for simplicity.
The first sim ulated population from each population type w as selected for the dem onstration of linkage-map construction.These populations are denoted as Pop1 to Pop4 for 4PDH,4PRIL,8PDH,and 8PRIL,respectively.To investigate the effect of distortion on m ap construction,populations w ith distorted m arkers w ere generated.Pop1 to Pop4 w ere used as a start,w ith no m arkers show ing distortion.The steps for generating distorted populations w ere as follow s.Considering types at m arker 12,fifty AA or BB individuals w ere random ly deleted in Pop1 and Pop2 and fifty AA,BB,CC,or DD individuals w ere random ly deleted in Pop3 and Pop4.The distorted populations had only 150 individuals,denoted as Pop5 to Pop8,respectively.For exam ple,in Pop5,the percentages of genotypes AA,BB,CC,and DD at m arker 12 w ere 16.00%,12.67%,36.00%,and 35.33%,w here the P-value of theχ2test for distortion w as equal to 4.29×10-6.Markers 5,6,8,9,11,13,14,and 15 also show ed segregation distortion at the significance level of 0.05,ow ing to their linkage w ith m arker 12.
Tw o hundred m arkers w ere evenly distributed on chrom osom e II.The m arker distance betw een any tw o adjacent markers w as set at 1 c M and the Haldane mapping function w as used to convert m apping distance to recom bination frequency.One bi-parental RIL population w ith 200 lines w as sim ulated w ith QTL IciMapping[24]and denoted as Pop9.Pop9 can be regarded as a special case of 4PRILconsidering the third parent to be the sam e as the first one and the fourth parent to be the sam e as the second one.Categories of all m arkers w ere ABAB.Before linkage m ap construction,the marker order w as shuffled.
Recom bination frequency estim ation and linkage m ap construction in Pop1 to Pop9 w ere perform ed w ith GAPL.The Haldane m apping function w as used to convert recom bination frequency(r)to m ap distance(d)in cM.For com parison,the R/qtl and R/m p Map packages w ere also used for linkage m ap construction in Pop 9.The best order in R/qtl w as determined by function“order Marker”w here the initial order was established by a greedy algorithm and then refined by rippling.A w indow size of 3 w as used for both ordering and rippling.Countxo(com paring orders by counting the num ber of obligate crossovers)and likelihood methods w ere used in rippling.The best order in R/m p Map w as determ ined by function“m porder”w ith tw o-point ordering selected.Multipoint ordering in R/m p Map w as not used for com parison,as its algorithm w as based on R/qtl.The other parameters w ere set at their default values.
Fig. 2 - User interface of the integrated software package GAPL. A, interface of functionality SNP; B, interface of functionality BIN; C, interface of functionality PLM; D, interface offunctionality PLQ.
The real dataset used in this study was derived from four Australian w heat cultivars(Yitpi,Baxter,Chara and Westonia[13]).A total of 1063 pure lines w ere generated by single-seed decent,and genotyped w ith SNPs,DAr T markers,and m icrosatellites.Verbyla et al.[28]used the R/m p Map package as w ell as m anual intervention to build the linkage m ap.The full genom e w as 5787.73 c M in length w ith 3230 m arkers distributed across the 21 w heat chrom osom es plus three additional linkage groups.Marker intervals w ere predom inantly shorter than 5 cM and the average m arker interval w as 1.79 c M.For com parison,the linkage m ap w as rebuilt w ith GAPL.The numbers of groups and of markers in each group w ere the sam e as those in Verbyla et al.
Four functionalities have been implem ented in the integrated software package GAPL version 1.2,i.e.(1)SNP,SNP genotypic data conversion;(2)BIN,binning of redundant markers;(3)PLM,map construction in multi-parental pure-linepopulations;and(4)PLQ,QTL detection in multi-parental pure-line populations.The four functionalities can act as a pipeline.The input file for one functionality can be found in the outputs of the previous functionality.Several examples are provided in an example folder in the softw are in three form ats,i.e.pure text,Microsoft Excel 2003 and Excel 2007.Missing genotypes are allowed in any functionality,and are not used in recombination frequency estim ation but m ay be imputed for QTL m apping using the linkage information.
The SNPfunctionality helps convert SNPdata of DNA bases(i.e.A,T,G,or C)into a form at that can be recognized in GAPL(A,B,C,D,E,F,G,or H)(Fig.2-A).SNPs show ing non-polym orphism in parents or progenies or m issing in one or m ore parents are deleted in this functionality.The output of the functionality can be directly used as input to the next functionality of redundantm arker rem oval or linkage-m ap construction.
The BIN functionality in GAPL(Fig.2-B)is similar to those of QTLIciMapping[24]and GACD[25].Users can use BIN to rem ove redundancy and perform quality control of m arkers,for example by deleting markers w ith high missing rates or severe segregation distortion.The output of the functionality can be used as input to the next functionality of linkage map construction.
The PLM functionality w as designed for linkage analysis and m ap construction in m ulti-parental pure-line populations(Fig.2-C).Three steps are involved in map construction:grouping,ordering,and rippling.Algorithm s used in the three steps are the sam e as those in QTL IciMapping[24]and GACD[25].Users can build linkage m aps by clicking buttons for grouping,ordering and rippling in turn.They can also m odify the constructed m ap at any step using the interface.PLM generates several files,including sum m ary inform ation of linkage m aps,LOD scores,recom bination frequencies and genetic distances between markers,and an input file for the next functionality,QTL m apping.
The PLQ functionality w as developed for QTL m apping in multi-parental pure-line populations(Fig.2-D).Three m apping methods are available in PLQ:single-m arker analysis(SMA,[29]),interval m apping(IM,[30]),and inclusive com posite interval m apping(ICIM,[8]).For each m ethod,several parameters should be determined before mapping,for example LOD threshold and scan step size.A LOD threshold can also be determined by perm utation testing.Both plain text files and figures are available to display m apping results including QTL positions,LOD scores,and effects at all scan positions.
Table 4 show s the average recombination frequency between neighboring m arkers estim ated from 1000 sim ulated populations.The recom bination frequency estim ate w as unbiased irrespective of population type.For exam ple,the truerecombination frequency between markers 1 and 2 w as 0.090,and the estim ated values w ere 0.090,0.091,0.091,and 0.090 in 4PDH,4PRIL,8PDH,and 8PRIL,respectively.The corresponding standard errors w ere 0.017,0.014,0.014,and 0.014,indicating the high estim ation accuracy.
Table 4-Means and standard errors of estim ated recom bination frequencies betw een neighboring m arkers on sim ulated chrom osome Iin sim ulated DH and RIL populations derived from four-w ay and eight-w ay crosses.
Fig.3-Linkage m aps in Pop9 constructed w ith d ifferentsoftw are packages.A,GAPL:the correct ord er w as achieved;B,R/qtl w ith the countx o m ethod:the constructed m ap w as broken by the three largest intervals;C,R/qtl w ith likelihood m ethod:the constructed m ap w as broken by the tw o largest intervals;D,R/m p Map:m arkers are in incorrect ord er,asindicated in rectangular boxes.
In Pop1 to Pop4,no m arker w as distorted,and m arker orders w ere the sam e as predefined,i.e.from m arker 1 to m arker 20.The estim ated m ap lengths w ere 99.53,99.34,100.39,and 99.79 c M,respectively,close to the true value 100.03 c M.In Pop5 to Pop8,som e m arkers w ere distorted,but the correct orders w ere still achieved.The estim ated m ap lengths w ere 97.84,99.97,101.65,and 100.30 cM,also close to the true value.It can be concluded that segregation distortion had little effect on recom bination frequency estim ation and linkage m ap construction.
Fig.3-A show s the linkage m ap constructed by GAPL in Pop 9.The marker order was the same as predefined,i.e.from m arker 1 to m arker 200.The length of the m ap w as estim ated at 200.58 cM,close to the true length of 199.0 c M.All m arkers w ere approxim ately evenly distributed.There w ere five intervals w ith length 0 c M:betw een markers 26 and 27,82 and 83,88 and 89,135 and 136,and 195 and 196,as no crossovers w ere observed in those intervals in Pop9.The m axim al length of m arker interval w as 2.70 c M,betw een m arkers 103 and 104 and markers 108 and 109.Differences in lengths of m arker intervals w ere caused by random recombination events in a lim ited-size population.Grouping,ordering,and rippling required 34 s on a personal com puter,Lenovo W510(Windows 10,Intel Core i7 Q720 CPU@1.60 GHz).
By R/qtl,if the countxo m ethod w as used in rippling,the m arker order w as 141-…-172-58-…-1-200-…-173-59-…-140(Fig.3-B).Here“x-…-y”represents continuous markers betw een tw o markers x and y.For example,“141-…-172”is order 141,142 to 171,172.“Marker 58-…-1”is order 58,57 to 2,1.The m ap length w as estimated at 298.65 c M,99.65 c M longer than the true order.Markers w ere not evenly distributed,and three long gaps w ere observed on the chrom osom e(Fig.3-B).The m aximum length of a m arker interval w as 34.66 c M,betw een markers 172 and 58,1 and 200,and 173 and 59.Linkage m ap construction required 131 s on the sam e personal com puter,m uch slow er than by GAPL.
By R/qtl,if the likelihood m ethod w as used in rippling,the marker order was 1-…-68-137-…-200-69-…-136(Fig.3-C).The m ap length w as estim ated at 265.83 c M,66.83 c M longer than the true length.Tw o long gaps w ere observed on the constructed m ap,each w ith length 34.66 c M,betw een m arkers 68 and 137 and markers 200 and 69.Map construction cost 5 h 21 m in and 36 s on the sam e com puter,m uch slow er than by GAPL and the countxo m ethod in R/qtl.
Fig.4-Linkage m ap constructed by GAPL using real data from a four-w ay cross RIL w heat pop ulation.Marker group ing is the sam e as reported in Verbyla et al.[28].
By R/m p Map as well,the marker order w as not the same as the predefined order.Markers 18 and 19,164 and 165,and 198 and 199 w ere reversed(Fig.3-D).The m ap length w as estimated at 220.97 c M,21.97 c M longer than the true length.The m axim um length of a m arker interval w as 2.56 c M,betw een m arkers 96 and 97,103 and 104,and 164 and 166.Linkage m ap construction required 48 s on the sam e computer,slow er than in GAPL but faster than in R/qtl.
The new ly built map by the GAPL softw are was 3807.10 c M in length:1457.21 c M for the A genom e,1277.86 c M for the B genom e,1019.90 c M for the D genom e,and 52.13 c M for the three additional groups(Fig.4)This length is 1980.63 c M less than that reported by Verbyla et al.[28],and closer to those of other published linkage m aps of w heat.The average m arker interval w as 1.01 c M.Each chrom osom e or group produced by GAPL w as shorter than that produced by Verbyla et al.
In this study,w e developed linkage-analysis m ethods for m ulti-parental pure-line populations.The unbiasedness and efficiency of our m ethods in recom bination frequency estim ation w ere confirm ed by sim ulation study and a reported four-w ay cross RIL population.Our methods have been im plem ented in softw are package GAPL,available for linkage m ap construction and QTL m apping in m ulti-parental pureline populations.GAPLbuilt better maps in much shorter tim e than R/qtl and R/m p Map.It is the first softw are package that is freely available for DH populations derived from four-w ay and eight-w ay crosses.
Missing markers contribute no information in linkage analysis,but their genotypes can be im puted from the constructed linkage m ap and then used for the next step of QTL m apping.An accurate linkage m ap leads to better imputation and consequently more reliable QTLidentification and genom ic analysis.Som e algorithm s have been proposed(for exam ple,[31,32])for detecting and rem oving genotyping errors to im prove the accuracy of linkage analysis.How ever,these algorithms w ere developed only for bi-parental populations and m ay not be suitable for m ulti-parental populations.Efficiency of algorithm s for m issing-genotype im putation and genotyping error correction in m ulti-parental populations needs further study.Once tested,suitable algorithms w ill be im plem ented in the next version of softw are GAPL.
GAPL can be directly applied to pure-line populations from less than eight parents.For exam ple,a top-cross represented by(A×B)×C is equivalent to the four-w ay cross(A×B)×(C×D),w here parent Cis sam e as D.A cross betw een tw o topcrosses,[(A×B)×C]×[(E×F)×G]is equivalent to the eightw ay cross[(A×B)×(C×D)]×[(E×F)×(G×H)],w here parent C is the sam e as D and G is the sam e as H.Linkage analysis m ethods and the GAPL softw are described in this study cannot be directly used for populations derived from m ore than eight parents.Extension to such populations m ay not be easy,ow ing to the difficulty in tracing the allelic origins of progeny to the parents,and the derivation of theoretical probabilities of m arker types.In future w hen such populations become com mon in genetic studies,the algorithm s w e have described m ay need to be expanded to m ore com plex m ulti-parental populations.The current version of GAPLis 32-bit and the upper lim it of m arker num bers is around 20,000.If m ore m arkers are involved,stack or m em ory overflow m ay occur.Another version of GAPL,for exam ple a 64-bit version,a comm and-line version,or a Linux version m ight allow increasing the upper lim it of m arker num ber.
GAPL is freely available at http://w w w.isbreeding.net.
This w ork w as supported by the National Key Research and Developm ent Program of China(2016YFD0101804),the National Natural Science Foundation of China(31671280),and HarvestPlus(part of the CGIAR Research Program on Agriculture for Nutrition and Health,http://w w w.harvestplus.org/).
Conflicts of interest
None.