Cho Hung ,Zhuo Chen ,Chengzhi Ling ,*
a State Key Laboratory of Plant Genomics,Institute of Genetics and Developmental Biology,Innovation Academy for Seed Design,Chinese Academy of Sciences,Beijing 100101,China
b University of Chinese Academy of Sciences,Beijing 100049,China
Keywords:Oryza Pan-genome De novo assembly Genetic diversity Wild rice
ABSTRACT The wild rice species in the genus Oryza harbor a large amount of genetic diversity that has been untapped for rice improvement.Pan-genomics has revolutionized genomic research in plants.However,rice pan-genomic studies so far have been limited mostly to cultivated accessions,with only a few close wild relatives.Advances in sequencing technologies have permitted the assembly of highquality rice genome sequences at low cost,making it possible to construct genus-level pan-genomes across all species.In this review,we summarize progress in current research on genetic and genomic resources in Oryza,and in sequencing and computational technologies used for rice genome and pangenome construction.For future work,we discuss the approaches and challenges in the construction of,and data access to,Oryza pan-genomes based on representative high-quality genome assemblies.The Oryza pan-genomes will provide a basis for the exploration and use of the extensive genetic diversity present in both cultivated and wild rice populations.
The genus Oryza includes two cultivated rice species:Asian and African rice,and 25 extant wild rice species[1].Asian rice(O.sativa)is the main source of food for more than one third of the world population[2].In the last century,rice production has been greatly increased twice by use of dwarfing and hybrid breeding[3].However,the yield growth has been accounted for partly by excessive use of chemical fertilizers,pesticides,and water resources,leading to ecological and environmental problems[4].Furthermore,population growth,reduction in available arable land,and other factors can lead to severe food shortages[5].Fortunately,recent advances in genomic technologies have brought about continual improvements in rice yield,quality,and disease resistance,particularly with the advent of efficient and precise molecular breeding[6-8].However,breeding selection tends to reduce genetic diversity in Asian rice,putting at risk the further improvement or maintenance of rice production[8-10],especially in the presence of environmental changes associated with global warming.
The long evolutionary history and world-wide distribution of O.sativa and other Oryza species have provided us with a rich source of germplasm and gene pools for rice improvement[11,12].Wild genetic resources have contributed up to 30%of the increase in crop yield in the last century[13].The exploration of these resources relies on research in functional and comparative genomics,with high-quality reference genomes as a foundation[14,15].Since the International Rice Genome Sequencing Project[16]was proposed in 1998 and especially in the past few years,with improvements in DNA sequencing technology,many rice genomes have been sequenced and assembled[17-27].High-quality genome sequences have permitted the identification of genomewide genetic variation,facilitating precise and accurate gene mapping and association analysis[14,28].However,recent studies have shown[29]that a single or a few reference genomes are not sufficient to cover the extensive genetic diversity in a population.The integration of multiple rice genome sequences to construct a pan-genome,which represents the full genetic information of a population rather than a specific individual,will provide a new foundation for the exploitation of genetic resources for rice improvement[30,31].
Pan-genome initially referred to the collection of all genes of a given clade[32].The pan-genome includes core genes present in all strains and dispensable genes present only in a subset of strains[33].When the gene is used as the basic unit of a pan-genome,the contribution of intergenic sequences to the genome function is ignored[34].Therefore,for eukaryotes with a large proportion of intergenic sequence in their genomes,it is necessary to build pan-genomes that include all the DNA sequence in a set of selected genomes[34].To date,pan-genomes have been developed in many plants,including rice[30,31,35,36],Arabidopsis thaliana[37],wheat[38],barley[39],soybean[40],maize[41],tomato[42],Brassica napus[43],B.oleracea[44],and Brachypodium distachyon[45].The pan-genomes have been used to deeply analyze the large number of genetic variations in populations,providing information about population structure,species origin and domestication,functional genes,and breeding[30,31,35,36].Several reviews[46-56]have summarized the construction methods,challenges and applications of plant pan-genomics.However,a high-quality pangenome of Oryza is yet to be constructed by integration of the large number of available high-quality rice genome sequences.
In this review,we summarize the genetic and genomic resources in the genus Oryza and progress in rice genome and pan-genome construction.For future work to support rice functional genomics and genetic improvement,we propose to construct Oryza pan-genomes containing the genetic variation in Oryza,including both genes and genome sequences.We discuss the approaches and challenges in constructing Oryza pangenomes based on high-quality genome assemblies.Development of integrated Oryza pan-genome database resources will ensure availability of the pan-genomic data to wet-lab researchers and breeders.
The Oryza species are currently divided into 11 genome types:AA,BB,CC,EE,FF,GG,BBCC,CCDD,HHJJ,HHKK,and KKLL[1](Fig.1),spanning about 15 million years of separation and differentiation[57].Among the wild rices,6 species share the same AA-genome type as the cultivated rice:O.rufipogon,O.nivara,O.barthii,O.glumaepatula,O.longistaminata,and O.meridionalis.Asian rice(AA)was domesticated from O.rufipogon and O.nivara[2,58],and African rice was independently domesticated from O.barthii[59].Domestication involved removing or reducing shattering and dormancy and changing plant type from creeping to erect[28,60].
Asian rice is conventionally classified into two subspecies,xian/indica(XI)and geng/japonica(GJ)according to differences in morphological characteristics,agronomic characteristics,and ecological or geographic distribution.Recently[61-63],O.sativa was divided into five subpopulations:xian,aus,aromatic,temperate geng,and tropical geng(Fig.2a),using DNA markers including simple sequence repeats(SSRs)and single-nucleotide polymorphisms(SNPs).The Fst value(fixation index,a measure of differentiation between populations)shows great divergence between these five subpopulations with a maximum Fst value of 0.53(between aus and temperate geng),and minimum of 0.23(between aus and xian)[61].Based on the sequences of 3010 rice accessions,the O.sativa population was further divided into nine subpopulations closely associated with geographic origin(Fig.2a),including four subpopulations of xian(XI-1A from East Asia,XI-1B with multiple origins,XI-2 from South Asia,XI-3 from Southeast Asia),three subpopulations of geng(GJ-tmp from the temperate zone of East Asia,GJ-sbtrp from subtropical Southeast Asia,and GJ-trp from tropical Southeast Asia),and cA and cB mainly from India and Bangladesh,respectively[30].Common-ancestry analysis[64]showed that Asian rice was domesticated from its wild relatives around 362.4 thousand years ago,and the absolute divergence of the geng,xian,and aus lineages occurred as early as 18.3,12.0,and 6.3 thousand years ago.
Asian rice has been used as a model food crop for functional genomic research,resulting in the cloning of hundreds of agronomically important genes influencing rice yield,quality,fertility,stress tolerance and efficient use of nutrients[1,65,66].Population genomic analyses have shown that most of these genes have been widely used in rice improvement,with many of them being used differentially between xian and geng and some favorable alleles remaining underutilized[67,68].Many differentiated genomic regions have been found to be responsible for environmental adaptation,and genetic exchange of key agronomic genes between subpopulations has contributed greatly to the improvement of modern rice cultivars and heterosis in xian hybrid rice[69,70].In fact,the long process of rice domestication was accompanied by introgression and transmission of advantageous genes between species or subspecies[30,31,58].The population-level distribution and utilization of these genes provide helpful information for further rice improvement.
Compared with Asian rice,African rice(O.glaberrima)has low yield but high resistance to drought,heat,salt,iron toxicity,green leafhopper,weed competition,and bacterial blight[71-73].African rice has lower genetic diversity than its ancestor O.barthii,and is less differentiated than Asian rice[59,73,74].Four major divisions have been identified in African rice[73]:northwest(NW)and southwest(SW)coastal and northeast(NE)and southeast(SE)inland populations,with possibly some other subpopulations such as floating and non-floating lowland and upland types[75](Fig.2a).The extensive genetic diversity between Asian and African rice[76]provides rich genetic resources for rice improvement via genetic exchange between them.For example,a new hybrid of O.sativa and O.glaberrima,called NERICA,combining the stress resistance of African rice with the yield advantage of Asian rice[77],has great potential in Africa to reduce food imports and alleviate hunger[78].
Cultivated rice has lost many agronomically important genes owing to domestication bottlenecks[24,79].Wild rice species possess many beneficial agronomic traits,such as resistance to stresses(diseases and insect pests,drought,salt and alkali,and high temperature),weed competitiveness,and cytoplasmic male sterility[12,80].These traits include resistance to bacterial blight in AA wild rice O.rufipogon,O.nivara,and O.longistaminata;resistance to blast in O.minuta,O.rhizomatis,O.longiglumis,O.ridleyi,and O.australiensis;resistance to brown planthopper in O.punctata,O.officinalis,and O.australiensis;salt resistance in O.coarctata;high biomass in O.rufipogon,O.minuta,and O.alta[11,12,81,82];and early-morning flowering in O.officinalis[83].
Owing to their lack of key agronomic traits,such as erect plant type,loss of shattering,and weak seed dormancy,the wild rices are not suitable for direct use in agricultural production[28].They have served mostly as a genetic resource bank(Fig.1),both as targets for functional genomics research and for improving the cultivated rice by the introgression of underlying genes.Several resistance genes originating in wild rice have been cloned,including those against rice blast such as Pi9 from O.minuta[84],Pid3-A4[85]and OsCERK1DY[86]from O.rufipogon,Pi54rh[87]from O.rhizomatis and Pi54of[88]from O.officinalis,and those for bacterial blight resistance such as Xa21 from O.longistaminata[89],Xa23from O.rufipogon[90],and Xa27 from O.minuta[91].However,there are still many uncloned genes such as Xa30(t)and Xa-32(t)from O.rufipogon,Xa29(t)from O.officinalis,and dozens of Bph genes associated with brown planthopper resistance from O.australiensis,O.officinalis,O.latifolia,O.minuta,and O.rufipogon[92].Recent studies[20,24-26]showed that several wild rice genomes contained thousands of genes not present in Asian rice,greatly expanding our knowledge of the genetic diversity among Oryza species and genome types.
Fig.1.Phylogenetic tree(adapted from reference[1]with permission from Springer Nature),genome assembly,and important agronomic traits of Oryza species.Arrows indicate the origins of tetraploids.The corresponding species in open circles(KK,JJ)either are extinct or have not been found[57].Some beneficial traits of the species[81,82]are listed in the last column.CMS,cytoplasmic male sterility;R-blast,resistance to blast;R-BPH,resistance to brown planthopper;R-GLH,resistance to green leafhopper;RWBPH,resistance to whitebacked planthopper;R-BB,resistance to bacterial blight;R-Shb,resistance to sheath blight;R-RYMV,resistance to rice yellow mottle virus;HT,heat tolerance;DT,drought tolerance;FT,flooding tolerance;CT,cold tolerance;ST,salt tolerance.
Fig.2.Population structure and wild ancestry of cultivated rice.(a)Arrows indicate the direction of gene flow and the origin of domestication.The two kinds of cultivated rice are derived from different wild ancestors and were independently domesticated from their ancestors,and the gene flow from geng to xian was greater than that from xian to geng in Asian rice[24,58,64].(b)The de novo domestication of O.alta[26].
Besides individual genes that can be used for rice improvement,several wild rice species possess the genetic feature of polyploidy,which in all crops confers advantages over their diploid relatives[93,94].Two such species are O.coarctata,with genome type KKLL,grown in coastal areas of East Asia with high saline-alkali resistance[95],and the high-biomass O.alta with genome type CCDD from South America[26].An O.alta accession has been successfully de novo domesticated by altering several agronomic traits:seed shattering,awn length,plant height,grain size,stem thickness,and heading date[26](Fig.2b).This achievement offers promise of a tetraploid cultivated rice with superior yield,environmental adaptability or resilience,and stress resistance.Population genomic analysis[26]showed higher genetic diversity present in wild rice CCDD populations than in cultivated rice,which offers great potential for future improvement of the domesticated O.alta.
In the past two decades,many rice genomes have been assembled to various levels of quality,from draft-genome to nearcomplete high-quality reference-genome sequences(Table S1),with the use of multiple sequencing and assembly technologies.After the release of the Nip reference sequence from the International Rice Genome Sequencing Project based on bacterial artificial chromosome Sanger sequencing in 2005[96],high-throughput short-read sequencing(Illumina,San Diego,CA,USA)and singlemolecule real-time(SMRT)long-read sequencing(Pacific Biosciences(PacBio),Menlo Park,CA,USA)have been developed to improve the existing Nip genome[17],and generated many new rice genome sequences,including xian IR64[35],Zhenshan 97 and Minghui 63[21],and Shuhui 498[22];Aus DJ123[35]and N22[18];and African rices CG14,TOG5681,and G22[59,97].These new genome sequences are mostly fragmented or incomplete,with hundreds or thousands of gaps.The released xian cultivar Shuhui 498(R498)genome sequence in 2017[22]contained only five gaps(in the pericentromeric regions of five chromosomes),with a contig N50 of 25.58 Mb,which has demonstrated the possibility of assembly of near-complete plant genomes using SMRT sequencing.
Modern genome assembly exploits the increased read length and/or base accuracy of single-molecule sequencing[98,99],combined with BioNano genome maps(BioNano Genomics,San Diego,CA,USA),and high-throughput chromosome conformation capture(Hi-C)sequencing.The combination of these technologies,along with the development of new assembly tools including CANU[100]and HERA[101]have greatly improved assembly quality,generating highly contiguous chromosome-level assemblies at low cost.The launch of HiFi sequencing by PacBio in 2019 represents a breakthrough in sequencing technology[98].With an accuracy of above 99%and a mean length of 13.5 kb based on the circular consensus sequencing model[98],genomes can now be resolved with unprecedented accuracy and speed.PacBio HiFi sequencing,either used independently or combined with Nanopore ultra-long sequencing,has enabled the completion of halfor full-length human chromosomes[99,102].A xian 93-11 sequenced with PacBio HiFi reads and Nanopore(Oxford Nanopore,Oxford Science Park,UK)ultra-long reads were assembled to very high continuity,with the contig N50 value reaching 32 Mb[27].Many high-quality rice genomes have been assembled since the release of R498,including 12 genome sequences with a mean gap number of 18 per genome and a mean completeness of up to 98.75%[23],and 31 genomes with mean contig N50 of 12.89 Mb and gap number of 63[103].These genomes have furthered the study of population diversity and of gene identification via accurate gene mapping or genome-wide association studies(GWAS)[14,104].
In the past decade,several wild rice draft genomes have been assembled with various technologies,including the BB genome of O.punctata[18],CC of O.officinalis[105],FF of O.brachyantha[20],GG of O.granulata[106],KKLL of O.coarctata[107],and six AA genomes(O.rufipogon[18,109],O.nivara[18,108,109],O.barthii[18,108,109],O.glumaepatula [108],O.longistaminata[18,108],and O.meridionalis[18,108,109]).In particular,the recent assembly of a high-quality chromosome-scale AA genome(contig N50,13.2 Mb)from a highly heterozygous O.rufipogon accession[24]and the assembly of the heterozygous allotetraploid O.alta CCDD genome(contig N50,18.2 Mb)[26]with completeness comparable to that of the cultivated rice genome showed that all wild rice genomes are ready to be assembled with SMRT PacBio sequencing,BioNano genome mapping and Hi-C sequencing,and the combination of CANU and HERA software[26].
Pan-genomic analyses have identified many sequences and genes not present in the Nip reference genome,and dispensable genes not present in all genomes.For example,an early study comparing the draft genomes of xian IR64 and aus DJ123 with the reference Nip identified 1300 new genes(absent from Nip)and 3144 dispensable genes including many genome-specific genes associated with disease resistance[35].By use of 1-3×low-coverage short-read sequencing data from 1483 accessions,more than 8000 coding genes were discovered that were absent in Nip[36].Large-scale pan-genomic analysis such as a study of 3010 rice genomes[30]identified 268 Mb new sequences,12,465 full-length new genes,and 19,721 dispensable genes.Using de novo assembled draft genomes of 66 representative cultivated and wild A-genome rice accessions[31],10,872 new genes that were at least partially missing in Nip and 16,208 dispensable genes were found.The dispensable genome has a higher density of SNPs or InDels with rich functions in immunity,defense response,and ethylene metabolism regulation,whereas the core genome provides basic life functions such as growth,development,and reproduction [30].Pangenomic analyses have facilitated gene mapping by GWAS for agronomic traits including grain length,grain width,and bacterial blight resistance[30],and shed new light on the evolution and domestication of rice[31].
Cutting-edge technologies in genomics,synthetic biology,gene editing,and molecular breeding provide great promise for accelerating crop improvement[110-113].In rice,precision breeding that combines multiple beneficial alleles for targeted improvement of complex traits can generate new rice cultivars with high yield,superior quality,and tolerance to stresses[8,114].Meanwhile,gene editing has achieved base insertion and deletion,fragment replacement,and site-specific modification[115].In addition to the de novo domestication of O.alta[26],several genes including BADH2[116],Wx[117,118],OSERF922[119],OsALS[120],and OSPAO5[121]have been edited in Asian rice to improve quality,resistance,or yield.The large-scale application of these technologies for rice improvement requires a comprehensive understanding of the complex genetic architecture underlying agronomic traits.However,current rice pan-genomes include only AA genomes of a few species,which do not cover the majority of geneticdiversity present in both AA genomes and the other Oryza species(Fig.1).There is thus an urgent need to construct Oryza pangenomes to integrate multiple cultivated and wild rice genomes,for characterizing genomic variations,genetic structures,and evolution of the entire Oryza genus.
In addition to identifying core and dispensable genes,the purpose of pan-genome analysis is to provide a unified reference genome for identifying genetic variation,including SNPs and,in particular,structural variations(SVs)in a population.SVs usually refer to changes in DNA sequence of more than 50 base pairs(bp),including insertions and deletions,presence and absence variations(PAVs),duplications,copy number variations(CNVs),and inversions[122].Recent studies[123]have shown that compared with SNP,SV has a greater impact on genome polymorphism and evolution,and that SV can affect many plant agronomic traits such as hybrid sterility,flowering time,fruit size,and resistance[124].In particular,CNVs and PAVs have been found to affect many rice genes,such as SGDP7 for grain size and grain number[125],Sc for hybrid male sterility[126],Sub1A/SNORKEL1/SNORKEL2 for submergence tolerance[127,128],Pup1[129]for phosphorusstarvation tolerance,Pikm1-TS and Pikm2-TS/Pi21 for blast resistance[130,131],and DPL1/DPL2/S27/S28 for hybrid male sterility[132,133].However,rice pan-genomes based on short-read sequencing data have missed most SVs,owing to fragmented assembly[134],especially large SVs and SVs located near highly repetitive sequence regions[135].For example,a comparison between Nip and R498 revealed a large number of SVs,including an inversion of about 5 Mb that could be identified only in high quality chromosome-level genome assemblies[22].The identification of large SVs is important for gene identification because genes around them may be hard to reach by conventional map-based gene cloning approaches.
Recently,studies in other crops such as soybeans[40]and rapeseed[43]have demonstrated the critical role of high-quality pangenomes in an SV analysis that facilitated GWAS for key agronomic traits.There is a need to construct a high-quality rice pan-genome based on high-quality chromosome-level genome assemblies.Recently,such an attempt has been conducted by integration of 33 high-quality genome sequences of cultivated rice[103].
A eukaryotic pan-genome is now commonly defined at the genomic sequence level to include all the DNA sequence of a population,not just the genes,as in a bacterial pan-genome.However,for genus-level Oryza pan-genomes,both a gene-based pangenome(GPG)and a sequence-based pan-genome(SPG)need to be constructed(Fig.3).For a collection of germplasm,a GPG includes all annotated alleles and orthologous genes at each locus and an SPG includes all genomic sequences.A full genus-level GPG is required to unify all genes across different genome types and species in Oryza.However,owing to the low sequence similarities except in regions of high collinearity between the different genome types in Oryza,a unified SPG for all Oryza species may not be necessary or practical,though a super-SPG[48]can be constructed on top of the genome-type-specific SPGs(gtsSPG).
The annotated genes in Oryza can be classified as core genes and dispensable genes that include private genes(present in at most only a few accessions,one species,or one genome type).A GPG contains the sequence variants of genes regulating key agronomic traits that are major targets for functional genomic and evolutionary studies.For example,a study based on the 3K rice pan-genome has found that the analysis of coding sequence-based haplotype(gcHap)sheds light on the history of rice domestication,confirming the domestication hypothesis of multiple origins of Asian rice[58],and also that gcHap-based GWAS is more accurate and reliable than SNP-based GWAS for detecting genes controlling complex traits[68].
An Oryza GPG can tell us which genes/alleles are present in cultivated rice and which genes or alleles are present in wild rice(identifying gene CNV or PAV),supporting the characterization of similarity and variation in these functional genes at the genus level(Fig.3).There is an urgent need to build a GPG using representative high-quality Oryza genome assemblies.Integration of Oryza GPG with other types of omics data will accelerate the study of gene function,especially for rare alleles and species-or genomespecific genes of wild rice.
An SPG is used mainly as a reference genome,to replace the conventional single-genome reference,for more comprehensive sequence mapping and genotyping of other accessions,and thus support better characterization of sequence variation,population structure and differentiation,and gene mapping.A gtsSPG can reflect the differences present at the DNA sequence level between Oryza genome types.Comparison of the gtsSPGs or of an SPG with a high-quality genome assembly also supports the analysis of synteny,SV and transposon diversity.Such a comparison will comprehensively reveal the genetic diversity of the entire Oryza genus,promoting understanding of the dynamic changes among genomes of each type and the evolutionary process of rice(Fig.3).
Pan-genomic analysis is especially important for the study of highly variable dispensable genes that are associated with agronomic traits such as disease resistance.Pan-gene family-analysis methods have been used to study the allelic genes and homologous genes of a family in multiple individuals of a species(or multiple related species)[136-138].For example,NLR genes,which are major components of the plant immune system,are highly variable both within and between species,in coding sequence,copy number,and genome location[136,139].The numbers of NLR genes range from 237 in an FF genome to 535 in an AA genome[18],and most of them are arranged in clusters.The conventional methods such as map-based cloning and genome-wide association analysis are inefficient for the identification of NLR genes and mapping short reads to a single reference genome does not readily reveal large SVs and CNVs.An Oryza pan-NLRome,the NLR component of the Oryza pan-genome,could be used for identifying NLR genotypes and allelic or orthologous relationships between various Oryza accessions or species,for characterizing NLR gene polymorphism and evolution,and for identifying related gene resources for improving disease resistance of cultivated rice.Another example is an AA-specific hydroxycinnamoyl tyramine gene cluster associated with broad resistance of rice and highly polymorphic among genomic locations,gene orientations and copy numbers[140].The construction of high-quality Oryza pan-genomes will facilitate the analysis and exploitation of these complex and highly variable gene families.
The gene information present in Oryza pan-genomes can be used in breeding in two ways.First,wild rice genes in the A genome may be introduced into cultivated rice by backcrossing,and the genes in the other genomes can be transferred by transformation or editing.Second,de novo domestication of wild species has been considered[141-143]as a viable scheme for the future design of an ideal crop.The successful editing of important traits in tetraploid wild rice O.alta laid a solid foundation for future gen-eration of a polyploid rice crop with superior agronomic traits[26].However,the further improvement of de novo domesticated wild rice requires knowledge of key orthologous or private genes of cultivated rice and their association with phenotypes.This knowledge will guide future precision and gene-editing breeding as suggested by the use of a recently constructed quantitative trait nucleotide map associated with quantitative traits for breeding guidance[144].
Fig.3.The two types of Oryza pan-genome.Right,a gene-based pan-genome(GPG),including all alleles and orthologous genes from all Oryza species/accessions.Left,sequence-based pan-genomes(SPGs).Owing to large sequence dissimilarities between genome types,it is not necessary to integrate all sequence variations from all Oryza species into one SPG.An alternative strategy is proposed:to build an SPG for each genome type.
To achieve full coverage of structural variations and complex gene regions such as gene clusters,it is necessary to use highquality genome assemblies to build Oryza pan-genomes.Currently there are many high-quality O.sativa genome assemblies,with a number of high-quality reference genomes of wild rice being released by I-OMAP(International Oryza Map Alignment Project)[81]in addition to assemblies of O.rufipogon(AA)and O.alta(CCDD).All or some of those genomes can be used for constructing a first version of an Oryza pan-genome.To construct complete Oryza genomes,however,more genomes from each Oryza species should be selected for high-quality genome assembly,based on population genomic study of the species(or genome types).
For pan-genome construction,a set of representative accessions should be selected to cover the full range of genetic diversity in each species[50,53].Based on studies in other crops such as soybean[40],around 30 representative accessions are needed for each species(or genome types),with particular attention paid to accessions from different geographical and ecological regions with desired agronomic traits,such as high resistance or high biomass.For best quality,these representative accessions can be selected from up to hundreds of different accessions of a species,based on the weight of each subpopulation according to population genomic studies of each species(or genome type)as in the case of soybean.
The next step is to sequence and assemble the genomes of those germplasms.Recent advances in high-accurate single-molecule sequencing(such as PacBio HiFi)and assembly technologies have provided great power to resolve heterozygous diploid and polyploid genomes[98].Combining these with BioNano genome mapping and/or Hi-C sequencing permits the assembly of a highquality Oryza genome at low cost.To balance the quality with the cost of sequencing all accessions,a few steps may be needed for sequencing and assembling the representative genomes in the next several years.For example,2-3 accessions may be completed in the first step,5-10 others in the next step,and the rest in the final step.We can optimistically expect that all the representative genomes can be sequenced and assembled in several years at a low total cost.
First,accurate gene annotation is needed for GPG construction.Currently the combination of RNA-seq,protein sequence and other evidence-based annotation with ab initio prediction is a standard approach to rice genome annotation[145].Some care is needed to ensure consistency between genomes and to minimize errors.For example,the same gene annotation pipeline and the same evidence should be used for all genomes.Second,alleles common to the accessions within a species or genome type and genes that are orthologous among genome types must be identified to form GPG loci that consist of functionally equivalent genes(alleles)across all selected genomes in Oryza,aiding in the identification of core and dispensable genes.Thanks to the high colinearity between Orzya genomes,the majority of the alleles and orthologs in Oryza can be identified by colinearity analysis,and the rest can be found in mutually best-matched hits by BLAST searching.Finally,the alleles and orthologs with the same sequence at each locus should be combined to reduce redundancy in the GPG.
The main strategy used to build an SPG for each genome type is to use an incremental approach starting from a base reference genome and stepwise adding newly identified sequence variants from the high-quality assemblies of other representative individuals[49,53].Since there are 11 different genomic types in the genus,11 different SPGs can be constructed,with each representing the sequence variation of one genotype(Fig.3).For high-quality genomes,genome-wide comparison must be performed to accurately identify SVs,for example using MUMmer[146]and SyRI[147].To remove redundancy at each locus(based on the positions on thereference genome)with multiple variants,sequences with similarity above a threshold,say,90%-95%can be removed from the final SPGs.
After the sequences for an SPG are obtained,they can be simply stored in a file(such as in FASTA format)as a reference sequence to be used by other applications such as genotyping tools using short reads for other accessions.However,it is hard to track the context and coordinates of the sequence variants outside of the reference genome,leading to problems in sequence alignment and variant identification.Accordingly,a nonlinear sequence-storing genome graph[148],which uses graph nodes and edges to represent sequences and their connections,has been developed to store an SPG,including a reference genome and sequence variants[149].
The effective storage of a pan-genome graph provides great potential as a new reference format for population genetics research[34].Use of a pan-genome graph can lead better read mapping and SV identification in population genomic studies than use of a linear single reference[40,150,151].However,there are still limitations to the wide application of this technology,owing mainly to the incompatibility of the conventional tools for data processing with genome graphs.Several tools have been developed for the construction and downstream analyses of genome graphs,including VG [152],Minigraph [153],Graphtyper[154,155],HISAT2[156],and Graph Genome Pipeline[157].It will be desirable to establish a standardized data file format and to develop more automated tools to help researchers with genome graph construction and analysis[150].In particular,with the rapid improvement of single-molecule sequencing technologies,the large-scale sequencing of rice germplasms with long reads(such as by PacBio HiFi,or Nanopore)will likely become feasible over the next few years.Accordingly,these tools should accommodate the genotyping of other accessions using long reads and assembled genomes.
To be accessible by ordinary users lacking bioinformatics expertise,pan-genomic data should be stored in structural databases with powerful query and visualization tools.Currently,the 3K RGP sequence variation information[30]is available in several databases including RPAN[158],SNP-Seek[159],RFGB[160],Rice-VarMap[161]and MBKbase-rice[162].The pan-genes from 66 rice accessions are available in RicePanGenome[31].For wild rice data,RiceRelativesGD integrates the genomic data of two cultivated and eight wild rices(six AA,one BB,and one FF),including thousands of genes that are absent in the Nip and R498 reference genomes.The majority of the databases use a single genome as a reference.MBKbase-rice[162],the rice sub-database of an integrated omics knowledge base,uses two reference genomes for genotyping of other accessions and provides a unified set of gene loci as a primitive GPG,currently containing 95,325 annotated loci from three genomes.
For future work,it will be desirable to connect Oryza pangenomes to the rich data resources of cultivated rice:gene expression,other multi-omics data,and T-DNA libraries.Meanwhile,new query tools with complex data mining capability and friendly web user interfaces should be developed to enable users to access the pan-genomic information and the associated multi-omics data.
The abundant genetic diversity in the Orzya species provides great potential for continuous rice improvement.Advances in sequencing technologies have permitted construction of highquality rice genome sequences at low cost.However,there is still a wide gap between rice genomic information and breeding.There is an urgent need for constructing Oryza pan-genomes,including a genus-level GPG and a set of genome-type-specific SPGs.The pangenomes will capture all sequence variation and cover all alleles and orthologs of Oryza,and thus make full connections between the cultivated and wild rice genomes,laying a foundation for the efficient use of the information in rice research and improvement.Further,a knowledge base that integrates Oryza pan-genomes with accurate functional annotation,multi-omics information,and powerful software tools for data analysis,retrieval,and visualization will make the information accessible to wet-lab researchers and breeders.
CRediT authorship contribution statement
Chengzhi Liangoutlined the manuscript,Chao Huangwrote the initial draft,Chengzhi Liangrewrote the manuscript with input fromChao Huang,and Zhuo Chenprovided critical comments on text and helped in designing figures.
Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work was supported by Chinese Academy of Sciences‘‘Strategic Priority Research Program”(XDA24040201),National Key Research and Development Program of China(2020YFE0202300),and State Key Laboratory of Plant Genomics.The authors also thank Xuehui Huang for his critical reading of the manuscript.
Appendix A.Supplementary data
Supplementary data for this article can be found online at https://doi.org/10.1016/j.cj.2021.04.003.