Vsu Aror,Neer Kpoor,Smr Ftm,Srik Jiswl,Mir Asif Iquel,Anil Ri,Dinesh Kumr,*
aCenter for Agricultural Bioinformatics,ICAR-Indian Agricultural Statistics Research Institute,New Delhi 110012,India
bSchool of Sciences,Indira Gandhi National Open University,New Delhi 110068,India
Keywords:Musaceae Musa Banana Microsatellite Short tandem repeat(STR)Primers
ABSTRACT The genus Musa is one of three genera in the family Musaceae,which includes bananas and plantains,which are monocotyledonous plants.Bananas have valuable nutritional content of vitamin C,B6,minerals,and dietary fiber and are a rich food energy source,given that carbohydrates account for 22%–32%of fruit weight.Molecular markers are valuable for crop improvement and population genetics studies.The availability of whole-genome sequence and in silico approaches has revolutionized bulk marker discovery.We describe an online web genomic resource,BanSatDB(http://webtom.cabgrid.res.in/bansatdb/)having the highest number(>341,000)of putative STR markers from Musa genera so far,represented by three species:M.acuminata(110,000),M.balbisiana(107,000),and M.itinerans(124,000)from 11 chromosomes of each species.BanSatDB has also been populated with 580 validated STR markers from the published literature.It is based on a three-tier architecture using MySQL,PHP and Apache.The markers can be retrieved by use of multiple search parameters including chromosome number(s),microsatellite types(simple or compound),repeat nucleotides(1–6),copy number,microsatellite length,pattern of repeat motif,and chromosome location.These markers can be used for Distinctness,Uniformity and Stability(DUS)tests of variety identification and for marker assisted selection(MAS)in variety improvement and management.These STRs have also proved to be helpful in classification of Musa germplasm to distinguish individual accessions and in the development of a standardized procedure for genotyping.These markers can also be used in gene discovery and QTL mapping.The database represents a source of markers for developing and implementing new approaches for molecular breeding,which are required to enhance banana productivity.
The Musa genus belongs to one of three genera of the family Musaceae of monocotyledonous plants.The Musa genus contains>70 species,represented by bananas and plantains.Banana is the fourth most important food in the world after rice,wheat,and maize in terms of production[1].Banana crops are grown in >135 countries,accounting for >130 Mt annual world production,one fourth of which is produced in India.More than 1000 varieties of bananas are known worldwide,categorized into 50 groups.Limited variety differentiation of banana by SSR and AFLP markers has been reported[2,3].Banana productivity can be increased enormously by genetic improvement[4].An attempt at variety differentiation by molecular markers has been made[2,3].Intellectual property(IP)of banana varieties having commercial valuable epitopic genes(edible vaccine)are required[5].In such a situation,STR-based profiling can be a rapid and economical protection of IP-containing banana varieties/lines.Markers are also required for genetic improvement including testing of ploidy level.With the availability of whole genome sequence,marker discovery can be done in silico.Such genomic resources can be made available online for convenient use by the global community.
Since the whole genome sequence of all three major species,M.balbisiana(550 Mb)[6],M.acuminata(523 Mb)[7],and M.itinerans(615 Mb)[8],have become available,their genomes can be used for mining of microsatellite repeats to be used as DNA markers.Such markers are useful in developing a standardized protocol to classify Musa germplasm and to distinguish individual accessions by genotyping.They can be used in studies of genetic diversity,germplasm characterization,and selection[9,10].The wild relative of banana,M.itinerans,can be a valuable source of genomic resources for disease resistance and can be used in breeding programs[8].
STR markers flanking to the targeted genes can be used in an introgression program if there is an allele size difference between donor and recipient parental population.In segregating populations in variety improvement programs,such markers offer the advantage of rapidity and cost effectiveness in genotyping[11].With the advent of next generation sequencing(NGS)technology,in silico STR discovery has drastically reduced the cost and time of genotyping[12].Limited numbers of markers have been developed to classify Musa germplasm,distinguish individual accessions,estimate genetic diversity estimation,develop linkage maps,and perform MAS in Musa[9,10].Flanking regions of STR are usually highly conserved and thus suitable for designing locus-specific primers[3,13].An in silico approach can be used forwhole-genome-basedminingalongwithautomated primer design at a desired STR locus of a specific chromosome carrying a gene of interest.Though limited numbers of validated markers have been reported,they have not been compiled in the form of a web genomic resource allowing with option to mine STR from all available genomes of Musa species.
The four existing Web genomic resources for the Musa genus include Banana Genome Hub(BGH)(http://bananagenome-hub.southgreen.fr/home1)[14],Musa Marker Database(MMDb)(http://www.agrogene.ac.cn:8088/mumdb/index.html)[15],TropGENE Database (TropGeneDB)(http://tropgenedb.cirad.fr/tropgene/JSP/index.jsp)[16]and Musa Germplasm Information System(MGIS)(https://www.cropdiversity.org/mgis/gigwa)[17].Although BGH and MGIS have SNP markers,they lack STR markers.The latest version of TropGeneDB(2016)contains 47,607 STRs mined from NGS sequences along with a set of 25 STRs from a study of Witherup et al.[18].Among these STR records,only 625 include chromosome number or linkage group.TropGeneDB does not permit a user to select a target chromosome with a specific STR location for use of the STR as a marker closely linked to a gene targeted in a molecular breeding program.Being a static database,there is no option for mining of STRs along with primer designing for genotyping.MMDb has the limitation of containing<175 thousand STRs from only two genomes of Musa species:M.acuminata and M.balbisiana,and none from M.itinerans.Although all these three Musa species are previously reported by Rotchanapreeda et al.[19]but it is confined to limited 24 universal STR primer pairs,based on STR discoverybygenomiclibraryin onespecies(M.balbisiana).
The present study was aimed at genome-wide mining of microsatellite repeats and development of a user-friendly database containing microsatellites from M.balbisiana,M.acuminata,and M.itinerans along with all previously validated markers.A further aim was to identify primers of desirable loci on specific chromosomes for rapid genotyping.
For short tandem repeat(STR)mining,nucleotide sequences of 11 chromosomes of M.acuminata DH Pahang v2(473 Mb),the draft sequence of M.balbisiana(403 Mb),and the draft sequence assembly of M.itinerans(463 Mb)were downloaded from BGH.MIcroSAtellite identification tool(MISA)(http://pgrc.ipk-gatersleben.de/misa/)[20]and custom Perl scripts were used to mine microsatellite markers from each of the three genome.In order to overcome computational limitations of larger size,scripts were written to fragment genomic sequences,perform marker mining,and analyze output.Experimentally validated markers were mined from the literature,curated manually,and deposited in a database(Fig.1).
BanSatDB is a Web-based relational database of microsatellites mined from the banana genome.It is an online relational database with a three-tier architecture consisting of client,middle,and database tiers(Fig.2).It was developed with MySQL and includes tables for predicted STR and experimentally validated markers.The database is linked to a user-friendly Web interface designed with the open-source scripting language PHP.The Primer3[21]standalone tool is integrated for designing forward and reverse primers for microsatellite sequences.
Fig.1–Workflow of BanSatDB.
The BanSatDB web server comprises of Home,About,Search,Analysis,Tutorial,and Team.The “Home”page describes about database and its potential use in molecular breeding.The “About”page describes the banana plant,its sequencing information,and the application of microsatellite markers.The “Analysis”page comprises genome-wide analysis of data that is shown graphically and in tabular form to present different attributes of STR loci.“Search”is the main page,with two submenus: “Predicted STR”and “Experimentally Validated STR”.The “Predicted STR”search page provides user with two search criteria:chromosome and whole genome.A microsatellite search is further categorized by motif type(mono,di,tri,tetra,penta,or hexa),repeat type(by typing the repeat motif)and repeat kind(simple or compound).For more customized criteria like location of chromosome,range of GC%and copy number,there is an option of “Advance search”which is also having option for primer designing for genotyping.“Experimentally Validated STR”allows a user to retrieve polymorphic or monomorphic loci.It also includes a more customized search option by motif type(mono,di,tri,tetra,penta,and hexa),repeat type,and repeat kind(simple or compound).
Fig.2–Three-tier architecture of BanSatDB.
Fig.3–Distribution of microsatellite types in Musa genomes.
Table 1–Motif-type distribution of microsatellites in Musa genomes.
Custom Perl scripts were written to study the transferability of STR markers across the three species of genus Musa.MISA Tool and Primer3 were used to generate markers and primers,respectively,for all three genomes.
Since Musa genome is of 523 Mbp having 11 chromosomes in the haploid set thus,we expect an average chromosome size of around 50 Mb[7].Cultivars of Musa have been developed by crossbreeding between the two species,M.acuminata and M.balbisiana,harboring the A and B genomes,respectively[22].Both are triploid,but apparently may also occur as diploid[7].Recently sequenced genome of M.itinerans species which is a close relative to banana progenitors thus can be used for testing of cross species conservation by in silico approach[8].The genomes of these cultivars have a complex genetic structure due to whole genome duplication and paralogous clustering in its course of evolution[6,23].Using MISA,a total of 111,930,107,228,and 124,020 microsatellite markers were identified in the M.acuminate,M.balbisiana genome,and M.itinerans genome sequences,respectively.
Fig.4–Frequencies of STRs based on their sizes in Musa genomes.
Table 2–Chromosome-wise abundance of STRs in the M.acuminata genome.
BanSatDB waspopulated with all of the microsatellites mined from the three genomes.Simple STR(86%)were more abundant than compound STR(14%)in all of the genomes(Fig.3).
In M.acuminata and M.itinerans genome,among simple types,the “mono”repeat type(for example,(A)n)was most prevalent at 49.11%and 46.08%,respectively,followed by “di”repeats(for example,(CA)n),at 33.81 and 36.11%.In contrast,in M.balbisiana,the proportions of “mono”and “di”were 40.15%and 41.96%,respectively.Although the dinucleotide repeattypeisobserved abundantlyin eukaryotes[24],mononucleotide repeats were the most abundant type in the M.acuminata and M.itinerans genomes.Since stringency of MISA parameters for detectability “mono”repeats were not compromised,thus higher abundance of mono-repeats might be due to the inherent limitation of the NGS technology used having sequencing error[25].
“Tri”-type repeats(for example,(CAG)n)constituted 15%of total markers in all three genomes,followed by “tetra”(1%–2%)and “penta”,and “hexa”types(<0.5%)(Table 1).
For>60%of the STR markers predicted in all three genomes,the length of the markers was<10 bp,followed by 11–13 bp with approximately 21%–22%,of total STR markers.Markers in the length range 14–25 bp accounted for 6%–7%of the total and those longer than 25 bp for 2%–3%(Fig.4).
The chromosome-wise abundance of STRs and their motif frequencies in M.acuminata were determined.Total repeat content was proportional to chromosome length,as expected for the ubiquitously distributed STR markers[26].Chromosome 8 harbored the most markers and chromosome 2 the fewest.Chromosome 4 showed the highest marker density(320.33 per Mb)and chromosome 2 the lowest(265.25 per Mb).The average densities were 237 and 266 markers per Mb for M.acuminata and M.balbisiana,respectively,exceeding the density in Arabidopsis(157 markers per Mb)[27].Other crops with similar densities include cucumber(367 markers per Mb)[28],rice(370–490 markers per Mb)[29],poplar(485 markers per Mb)[30],and grape(487 markers per Mb)[31].The STR markers are evenly distributed with homogeneity in terms of distance(Table 2).
Fig.5–GC content of STRs in Musa genomes.
Fig.6–The flow of a database search in BanSatDb.
The proportion of GC content in STR in the range in terms of percentage were:0–25 was greatest(61%–66%)followed by the range 25–50(24%–30%),with the lowest proportion in the ranges 50–75(4%–5%)inM.acuminataandM.itineransgenomeand75–100(0.04%)in the M.balbisiana genome.Fig.5 shows the distribution of microsatellite length in relation to GC proportion.No correlation was found between size of STR locus and GC content.
As soon as he caught sight of the three from far off he took his shining shield from his shoulders, and held it up like a mirror, so that he saw the Dreadful Women reflected in it, and did not see the Terrible Head itself
As the number of these predicted STR is large,geneticists and breeders may wish to select STR markers from specific genomic regions,suggesting the development of a database that enables the selection of STR markers.Many such STR databases have been developed,including for pigeon pea[32],chickpea[33],tomato[34],rice[35],maize,sorghum,soybean[36],and cotton[37].User can mine the SSR markers on desired chromosome and search for experimentally verified SSRs along with primer generation in BanSatDb(Fig.6).
Extent of conservation of STR loci were obtained successfully by computational approach depicting common loci across three genomes.This analysis revealed that out of a total of 207,983 STR markers obtained from all the three genomes,only 5979(2.87%)were found to be conserved among all the three genomes.Two species comparison revealed that 13,649 and 17,020 STRs are conserved between M.acuminata and M.balbisiana and M.acuminate and M.itinerans genomes,respectively(Fig.7).
The results suggest that these transferable STR markers could be used for identification of markers associated with specific traits in other Musa species.EST-STR markers were found to be 100%transferable among the eight Musa genomes in a study[38].Similar tests have been made among species of sugarcane[39],grapes[40],sunflower[41],and pine[42].Cross-taxon transferability of STR markers has also been observed among Musa and related species[15].
BanSatDB supports a microsatellite search in any of the three sequenced genomes:M.acuminata,M.balbisiana,and M.itinerans,using multiple parameters including microsatellite type (simple orcompound),repeattype (mono-to hexanucleotide),copy number,length of marker,pattern of repeatmotif,andchromosomelocation ofamarker.Chromosome-wise microsatellites can be retrieved from any of the three Musa genomes species.Once the marker is located,all its information(chromosome,contig,or scaffold number,repeat kind,motif,repeat type,length of repeat,size of repeat,start of repeat,end of repeat,STR sequence,and GC content)are displayed on the result page.BanSatDB also supports this customized search on ranges of GC content,location(base pair length)and copy number.
Fig.7–Transferability of STR across the three Musa genomes.
User can view neighboring markers just by clicking on a chromosome,contig,or scaffold number.Users are also provided the option to replace degenerate bases with any alternative base(A,T,G,or C)and specify the length of the flanking sequence.The feature has been added to resolve the issue of degeneracy of bases present in genome assemblies,making primer design difficult without such an option.
BanSatDB has been developed as a dynamic database that allows users to select STR loci having high probabilities of polymorphism,saving the time and cost of wet lab-based polymorphism discovery.Given that STR polymorphism depends on type,length,and structure of repeats(simple,compound,interrupted)[43],flexible provisions have been made for STR mining.It is well known that the longer the STR repeat array,the higher is the probability of finding a large allele number or high degree of polymorphism in a given gene pool.Simple repeats are faster-mutating than compound or interrupted repeats.Given that STR alleles are generated by slippage in DNA replication,a longer array of STR loci mutates faster.For dinucleotide repeats,the threshold value for such slippage events starts with STR loci containing>8 repeat unit[44].For example,CA(20)will mutate faster than CA(10).User may select a larger repeat array that is closely linked to or lies in the flanking regions of their gene of interest for genotyping.The option to specify amplicon size furtherenables multiplexing of genotyping with fluorescent dyes,supporting economical and accurate allele sizing during generation of STR allelic data[45].This genomic resource can be valuable for Distinctness,Uniformity and Stability(DUS)tests of variety identification and for MAS in varietal improvement and germplasm management[34].
Validated STR markers along with their forward and reverse primers previously reported in the literature can be retrieved using BanSatDB.This retrieval can be further customized by option to select polymorphic/monomorphic along with type,repeat and kind of motif.The result page shows tabulated information of the markers with motif details,polymorphism,forward and reverse primer pairs,and references and links to the publications in which the markers were reported.
Acknowledgments
The authors thank the Indian Council of Agricultural Research(ICAR),Ministry of Agriculture and Farmers'Welfare,Government of India for providing all facilities.The authors also thank the Director,ICAR-IASRI,New Delhi for the use of the ASHOKA computational facility where all analyses were performed using public-domain data.English editing help of Prof.James Nelson,Kansas State University,USA is acknowledged with sincere thanks.