Lei JiaLingjuan Xie ,Sangting Lao ,Qian-Hao Zhu ,Longjiang Fan ,*
a Hainan Institute of Zhejiang University,Yonyou Industrial Park,Sanya 572025,Hainan,China
b Institute of Crop Science&Institute of Bioinformatics,Zhejiang University,Hangzhou 310058,Zhejiang,China
c CSIRO Agriculture and Food,Black Mountain Laboratories,GPO Box 1700,Canberra,ACT 2601,Australia
Keywords:Rice Bioinformatics Genomic data Database Tool
ABSTRACT Rice is one of cereal crops and a model species for monocots.Since the release of the first draft rice genome sequences in 2002,considerable progress has been achieved in rice genomic researches,thanks to rapid development and efficient utilization of bioinformatics methods and tools.In this review,we summarize the progress of studies of rice genome sequencing and other omics and introduce the wellmaintained bioinformatics databases and tools developed for rice genome resources and breeding.After reviewing the history of rice bioinformatics,we use single-cell sequencing and machine learning as examples showing how bioinformatics integrates emerging technologies and how it continues to develop for future rice research.
Bioinformatics is an interdisciplinary branch of biological science that develops methods and tools for collecting,processing,and analyzing diverse biological data to understand biological function.Bioinformatics played an indispensable role in the release of the first rice genomes in 2002 and has contributed greatly to rice genomics and functional genomics studies ever since.In the research fields of rice genomics,from comparison of gene sequences to genome assembly,prediction of genome elements,and comparative genetic analysis of rice populations,achievements are unthinkable without the applications of bioinformatics algorithms and tools.In rice genetics and breeding,bioinformatics assists in discovery of genes underlying agronomic traits,prediction of phenotypic outcomes based on genotype data,and evaluation of genetic effects of genetic components.Combining with cutting-edge gene-editing technology,bioinformatics will play an even greater role in synthetic genomics and genome designbased rice breeding in future.
‘‘Rice bioinformatics”was first coined twenty years ago by the scientists at The Institute for Genomic Research(TIGR),USA,to describe analyses of sequence data of rice,a model monocot species[1].In this review,we summarize the achievements of rice genome sequencing and other omics studies and describe the progress made in development of databases containing rice genomic resources and bioinformatics tools for genome mining and breeding.We then use single-cell sequencing and machine learning as examples to illustrate how bioinformatics can be integrated with these state-of-the-art technologies to advance studies of rice biology.We believe that the role of bioinformatics in promoting rice biological research and increasing rice breeding efficiency will be greater than ever before in the digital era of big omics data.
Organized at the International Symposium on Plant Molecular Biology held in Singapore in 1997,the International Rice Genome Sequencing Project(IRGSP)started the process of rice wholegenome sequencing.Using the strategy of constructing bacterial artificial chromosome(BAC)and bacteriophage P1-derived artificial chromosome(PAC)libraries combining with map-based clone-by-clone shotgun sequencing,IRGSP was committed to completely and accurately sequencing the genome of Oryza sativa ssp.japonica cultivar Nipponbare[2].Using the whole-genome shotgun(WGS)sequencing approach,Chinese and American researchers completed and published in 2002 the draft genome sequences of 93-11(O.sativa ssp.indica)and Nipponbare(Fig.1;Table 1)[3,4].In 2005,with the joint efforts of researchers from 10 countries,the IRGSP reported the map-based and finished version of Nipponbare sequence with 95%genome coverage[5].In the following years,researchers in China,Japan and USA corrected andupdated the genomes of Nipponbare and 93-11 by sequencing more BAC and PAC clones and adopting new assembly algorithms[6-8].
Fig.1.Progresses in Oryza genome sequencing.
Table 1Progress of de novo sequencing and assembly of Oryza genomes.
Table 1(continued)
The broken line in the figure shows the growth of de novo sequencing of the rice genome.The bar graph shows the growth of rice genome resequencing.The orange part represents the number of newly resequenced materials that year,and the blue part represents the cumulative number of materials already resequenced.
Since 2010,adoption of next-generation sequencing(NGS)technology and new assembly algorithms has brought rice genome sequencing and assembly to a new stage.In 2013,using whole genome resequencing data obtained with the NGS platform,Kawahara et al.[9]updated and released the Nipponbare rice genome(IRGSP-1.0),which has since been widely used as a reference genome in the rice research community.In the same year,using the Illumina HiSeq2000 platform,Chinese researchers constructed a highresolution physical map based on resequencing 132 core recombinant inbred lines(RILs)derived from the pioneer super hybrid rice LYP9(from PA64s×93-11)[10].Thanks to the high-density physical map,the genome sequences of 93-11 and PA64s were also further improved.In 2014,two groups[11,12]reported the genome assembly of African cultivated rice,O.glaberrima,using different strategies.Later,researchers from Japan,USA and India sequenced and de novo assembled the indica varieties IR64 and RP Bio-226 and aus cultivars Kasalath and DJ123 using the NGS technology[13-15].
The emergence of third-generation sequencing(TGS)technologies,such as PacBio Single-Molecule Real-Time(SMRT)sequencing and Oxford Nanopore Technology(ONT),changed the landscape ofrice genome assembly.Long reads obtained by TGS can overcome the drawback of NGS short reads,which encounter difficulties in assembly of complex genome regions consisting of repetitive elements.In addition,methods such as high-throughput chromosome conformation capture(Hi-C)and optical mapping provide scalable and cost-effective means for scaffolding genomes into chromosome-scale assemblies[16].Based on the BAC sequences obtained by PacBio SMRT sequencing,Zhang et al.[17]assembled the genomes of two major elite cultivars of the indica subspecies,ZS97 and MH63.By integration of PacBio SMRT sequencing,mapping data,genetic mapping,and fosmid sequence tags,Du et al.[18]reported a near-complete genome of the indica rice Shuhui498(R498).The assembled genome covers more than 99%of the estimated genome size and serves as another high-quality reference genome of the indica subspecies.In 2018,Zhang et al.[19]reported the reassembly of the genomes of Nipponbare and 93-11 using PacBio SMRT sequencing.The N50 values of the newly assembled Nipponbare and 93-11 reference genomes reached 16.97 Mb and 9.65 Mb,respectively.In addition to those mentioned above,at least other 6 indica[20-25],6 japonica[26-30],1 aus[22],2 aromatic[31],and 4 African cultivated rice[32,33]genomes have been de novo assembled or reassembled to date(Table 1).
Relatives of rice can provide crucial genetic traits(such as resistance to biotic and abiotic stresses)for rice breeding[34].To promote Oryza research and improvement in rice breeding,rice relatives including wild rice O.brachyantha [35],O.barthii[12,22,33],O.nivara[12,22,33],O.meridionalis[12,22,36],O.glumaepatula [12],O.longistaminata [37-40],O.rufipogon[22,33,36,40-43],O.punctata[22],O.granulata[44,45],weedy rice O.sativa f.spontanea[46],Zizania latifolia[47],Leersia perrieri[22],and Echinochloa crus-galli[48,49]have also been sequenced and assembled.
NGS sequencing and the development of corresponding bioinformatics analysis methods have paved the way for the development of a sequencing-based high-throughput genotyping for mapping populations and organisms[50].More than 23 populations(over 20,000 accessions)have been resequenced to date(Fig.1;Table 2).Here we briefly introduce some of these populations,presenting a complete list of the resequenced rice populations in Table 2.In 2009,Huang et al.[50]developed the first high-throughput method for genotyping recombinant populations and constructed a genetic map using 150 rice recombinant inbred lines.In 2010,Xie et al.[51]described a method for constructing ultrahigh-density linkage maps composed of high-quality SNPs based on low-coverage genome sequencing of recombinant inbred lines.In 2012,using genome resequencing of 446 O.rufipogon accessions and 1083 cultivated rice accessions,Huang et al.[52]found that japonica rice was first domesticated from a specific population of O.rufipogon around the middle area of the Pearl River in southern China,and that indica rice was subsequently developed from crosses between japonica rice and local wild rice.In 2014,Li et al.[53,54]reported the resequencing of a core collection of 3000 rice accessions from 89 countries,a resource that can serve as a foundation for large-scale discovery of novel alleles for desirable rice phenotypes.In 2015,using Huanghuazhan as a recurrent parent and eight diverse elite indica lines as donors,Zhu et al.[55]developed an interconnected breeding population and resequenced 497 lines to identify stably expressed QTL for cold tolerance at the booting stage.In 2016,Huang et al.[56]resequenced 10,074 F2lines from 17 representative hybrid rice crosses and usedthe results to dissect the genomic architecture of heterosis and provide guidelines for rice hybrid breeding.More recently,Li et al.[57]reported the resequencing of a collection of 1275 rice accessions consisting of widely planted cultivars and parental hybrid rice lines from China.Their genotypic analysis of agronomically important traits revealed that many favorable alleles are underused in elite accessions.Diverse rice populations,for example,1495 elite hybrid rice cultivars[58]and 524 global weedy rice accessions[59],have also been resequenced,providing valuable genomic resources for future functional genomics studies and rice breeding.
Table 2Progress of rice population resequencing.
In order to fully explore the genetic diversity of rice genome,besides assembling single genomes,researchers have begun to construct a rice pan-genome(including dispensable and core genome).In 2015,using a metagenome-like assembly strategy,Yao et al.[74]constructed a dispensable rice genome based on lowcoverage resequencing of 1483 cultivated rice accessions,representing the first rice pan-genome.In 2018,using deep sequencing and de novo assembly of 66 divergent rice accessions using NGS short reads,Zhao et al.[75]constructed another rice pangenome.In the same year,based on the 3000 Rice Genomes Project(3KRG),Wang et al.[54]constructed a third rice pan-genome and identified more than 10,000 novel full-length protein coding genes and a high number of presence/absence variations based on the pan-genome.The sequence variants identified by intergenomic comparisons are expected to promote diverse genomics and functional genomics studies in rice.Combining PacBio SMRT long reads,NGS short reads,and Bionano Optical Maps,Zhou et al.[76]de novo assembled 12 near-gap-free rice reference genomes of representative cultivated Asian rice,providing a platinum-standard pangenome resource.
Table 3Specific databases containing rice genome resources.
Table 3(continued)
The first large-scale expressed-sequence tag sequencing in rice was performed around 2002,when the first rice genome was sequenced.In 2003,Kikuchi et al.performed the first large-scale full-length cDNA sequencing in rice[77].Later,rice transcriptome sequencing(RNA-Seq)studies based on NGS were reported by two groups[78,79]at the same time,and revealed not only many alternative splicing events but also novel transcriptionally active regions.Since then,many transcriptomic studies based on highthroughput sequencing have been performed in rice.Their details are beyond the scope of this review.
The epigenome refers to genome-wide modifications of DNA that do not change sequence but determine gene expression,including DNA methylation,histone modification,and chromatin remodeling.As a model plant,rice has contributed much to our understanding of the effect of epigenomic changes on plant development and physiology[80-88].A recent study generated epigenomes for 20 rice cultivars[89],providing a reference for epigenomic research in rice and other plants.For further details of rice epigenomics studies,readers are referred to recent reviews[90-97].
With advances in sequencing technology and bioinformatics methods,rice genomics studies have stepped into the big-data era.However,as in other crops,acquisition of large-scale phenotypic data has become a bottleneck in rice breeding and genomics studies.Recent advances in high-throughput image acquisition and development of bioinformatics tools for image analysis have made high-throughput phenotyping possible[98].The following are representative advances in high-throughput phenotyping achieved in 2020:(1)Teramoto et al.[99]developed a promising approach for rapid quantification of the underground distribution of rice rootsfrom trench profile images employing a convolutional neural network,a deep learning model;(2)Liu et al.[100]used the Scale-Fusion Counting Classification Network(SFC2Net).which integrates deep learning and advanced computer vision technology to achieve accurate rice yield estimation;(3)Conrad et al.[101]developed diagnostic tools for identifying rice plants infected by Rhizoctonia solani(causing rice sheath blight)but with asymptomatic symptoms based on spectral profiles generated using near-infrared spectroscopy combined with machine learning.
In order to better organize and use the large-scale genomic data generated by rice genome studies,diverse databases have been constructed.They can be generally separated into two types:comprehensive and rice-specific,based on their contents,aims and scopes,and whether or not they are rice-oriented.
Comprehensive databases contain multiple types of data for multiple species including rice.GenBank,maintained by the National Center for Biotechnology Information(NCBI)(https://www.ncbi.nlm.nih.gov/)is a well-known comprehensive database that provides a large collection of biological information and data;Ensembl Plant(https://plants.ensembl.org/)integrates tools for visualizing,mining,and analyzing plant genomics data;NGDCGWH(https://bigd.big.ac.cn/gwh/),maintained by China National Center for Bioinformation,is a public repository housing genome-scale data for a wide range of species;Phytozome(https://phytozome-next.jgi.doe.gov/)hosts rich plant genomics resources;Gramene(http://www.gramene.org/)hosts comparative plant resources; Genome OnLine Database (GOLD)(https://gold.jgi.doe.gov/)is a resource for comprehensive access to information regarding genome sequencing projects.
Some databases are rice-specific,with a focus on rice genomic resources including genome sequences,genome annotations,and genome variations(Table 3).The Rice Annotation Project Database(RAP-DB)and Rice Genome Annotation Project(MSU-RGAP)database are two well-known databases that provide genome annotation resources for the first rice reference genome Nipponbare[9,102].The genome annotations in RAP-DB have been continuously updated.Based on the indica reference genomes ZS97 andMH63,Song et al.developed the Rice Information Gateway(RIGW)database,which hosts multi-omics data including genomics,transcriptomics,and protein-protein interactions[103].Based on Nipponbare and a high-quality indica reference genome,R498,Molecular Breeding Knowledgebase(MBKBASE)integrates rice germplasm information,population sequencing data,phenotypic data,and various other genomics data sets[104].Information Commons for Rice(IC4R)[105]and Rice Genome Hub[106]provide a comprehensive resource dedicated to integrating multiomics data for rice.In 2020,Zhang et al.[107]developed a species-specific epigenomic database,eRice(an Epigenomic&Genomic Annotation Database for Rice),to facilitate efficient epigenomic studies in rice.Rice Pan-genome Browser(RPAN)and Rice-PanGenome are two databases that provide resources and tools for rice pan-genome analysis[75,108].OryzaGenome is a genomediversity database of wild Oryza species,hosting genomic resources for the genus Oryza[109].Rice Relatives Genomic Database(RiceRelativesGD)hosts gene and genomic resources of 12 rice relatives from Poaceae[34].
Fig.2.Tree of genomic databases and bioinformatics tools for rice scientists.According to different research goals and needs,the tree provides representative databases or tools.
Fig.3.Milestones of rice bioinformatics:genome,data mining,databases,and tools.
In addition to the genome databases described above,other databases focusing on rice genomic variation,gene expression,gene function,and mutations are listed in Table 3.Besides providing genomic variation information,Rice Variation Map(RiceVar-Map),Rice SNP-Seek Database and Rice Functional Genomics&Breeding(RFGB)also provide information about functional annotation of variation,phenotype,and rice cultivars [110-112].SnpReady for Rice(SR4R)and HapRice provide haplotype map(hapmap)SNPs and haplotype information for rice[113,114].Rice Transposons Insertion Polymorphism Database(RTRIP)and Rice TE Database(RiTE DB)are databases that provide resources for rice transposable elements[115,116].Rice Functionally Related gene Expression Network Database(RiceFREND),Rice Expression Profile Database(RiceXPro)and Rice Expression Database provide data such as gene expression profiles and co-expressed genes[117-119].Rice Mutant Database(RMD),KitaakeX Mutant Database(KitBase)and The PGSB Oryza sativa database(MOsDB)are databases describing rice mutants and mutations[120-122].
For biologists and breeders,the bioinformatics tools integrated in the database or other independent websites afford an excellent platform for analyzing large data sets,enabling data-driven discoveries[123].Web-based bioinformatics tools enable researchers to perform desired analyses without the need of in-house highperformance computing resources or software development.To help biologists and breeders choose the right tools according to their needs,we review here some of the tools integrated in rice databases and other online tools designed for rice research(Table 3).Among the bioinformatics tools integrated in databases,RAP-DB provides tools that can query and utilize its resources,including genome browser,keyword search,BLAST and ID converter.MSU-RGAP provides tools including diversified search and retrieval of information about gene co-expression and multiple other types of data.KEGG/GO enrichment analysis and designing sgRNAs for CRISPR-based gene editing can be done with RIGW.HK-TS Gene Finder provided by IC4R can be used to select rice tissue specifci and house-keeping genes based on T-value and expression breadth.DNA methylation analysis tools and 6 mA AI predictor are provided by eRice and haplotype network analysis tools are provided by RiceVarMap.Web tools developed specifically for rice research can also be found at sites other than these databases.Rice Galaxy provides tools for designing SNP assays,analyzing GWAS studies,population diversity,and rice bacterial pathogen diagnostics,and a suite of published genomic prediction methods[124].Rice Diversity provides tools including GWAS viewer,Rice Sub-population Viewer,and Seed Photo Library Viewer.By integrating publicly available data and reviewing publications reporting rice functional genomic studies,Yao et al.[125]developed a comprehensive database,funRiceGenes,including~2800 functionally characterized rice genes.CRISPR-GE developed by Xie et al.includes DSDecodeM and MMEJ-KO and is a convenient,integrated toolkit to expedite all experimental designs for CRISPR/Cas9/Cpf1-based genome editing and analysis of the consequent mutations in rice and other plants[126-128].Hi-TOM,an online tool,can be used to track gene editing mutations with precise percentages for multiple samples and multiple target sites[129].Another gene-editing tool(CAFRI-Rice)is CRISPR applicable functional redundancy inspector,which can be used to find suitable target genes for editing to avoid functional redundancy[130].
A tree depicting the latest available key genomic information or databases and bioinformatics tools for rice researchers is shown in Fig.2.
The key developments in rice bioinformatics that have driven advances in rice genomics and functional genomics,such as the rice genome project,sequence data mining,and development of genomic databases and tools,are shown in Fig.3.When rice sequences and sequence-based molecular markers became available at the end of last century,bioinformatics began to become involved in understanding rice biology.Generally,rice bioinformatics is genome-driven:rice genomic studies promoted applications of bioinformatics in rice.In the first 16 years(1989-2005),many bioinformatics efforts were made in reading rice organelle and nuclear genomes;for example,algorithms and tools for rice genome assembly and gene prediction were developed.When high-quality reference genomes became available,the genome-oriented databases(MSU-RGAP/RAP-DB)were developed to keep abreast of the increasing demands of the rice community for identifying gene functions.Investigation of rice populations via genome resequencing[50,52,53],the first such effort performed in crops,was enabled by NGS but also by algorithms developed specifically for handling the resequencing data generated in rice.Based on the genomic resequencing data,many databases[54,75,105]with integrated tools that can be used to perform investigations of genomic variations and pan-genome have been developed in recent years.Development of tools suitable for mining genomic data has resulted in interesting and important findings about rice genetic diversity and evolution.Bioinformatics efforts in genomes and databases of rice relatives have made it easier to use these valuable resources in rice breeding,particularly genome-design breeding.
A large body of biological information generated by new technologies has recently emerged.Bioinformatics needs to keep pace with the evolution of the new biological technologies.Here we use single-cell sequencing as an example to show the integration of bioinformatics with data produced by state-of-the art technologies in rice.
Single-cell sequencing technologies refer to the sequencing of the genome and/or transcriptome of single-cells to obtain genomic,transcriptomic or other multi-omics information with the aim of discovering cell population differences and cellular evolutionary relationships[131].Depending on the tissues used in sequencing,the data generated could be highly heterogenous,challenging bioinformatics analysis.Single-cell sequencing has been applied in studies of rice biology in recent years.In 2017,Han et al.developed a bioinformatics protocol based on single-cell RNA-seq data in rice and investigated allelic expression patterns in mesophyll cells of 93-11 and Nipponbare inbred lines,as well as of their F1reciprocal hybrids[132].Based on RNA-seq data of rice mesophyll cells,a standard RNA-seq variant analysis workflow was performed to identify SNPs between indica and japonica rice.Combining information of SNPs and SNP-covering reads,a bioinformatics workflow was developed to classified genes into biallelic,monoallelic,and silenced genes.The development of a single-cell RNA-seq bioinformatics analysis protocol in this study offers an excellent opportunity to investigate the origins and prevalence of monoallelic gene expression in plant cells.In another study,based on single-cell sequencing and Hi-C,Zhou et al.developed a highresolution in situ Hi-C approach and analyzed individual nuclei isolated from rice eggs,sperms,unicellular zygotes,and shoot mesophyll cells to compare three-dimensional(3D)chromatin organization and dynamics before and after fertilization in rice[133].Their results revealed specific 3D genome features of rice gametes and the unicellular zygote,and provided a spatial chromatin basis for zygotic genome activation and epigenetic regulation in rice.More recently,single-cell sequencing has been used to study cell biology of rice plants growing in various environmental conditions.Based on single-cell RNA sequencing,Wang et al.[134]developed a bioinformatics pipeline to identify major cell types and reconstruct their developmental trajectories.Their analysis found that abiotic stresses not only affect gene expression in a cell-type-specific manner but also affect the physical size of cells and the composition of cell populations.
Machine learning is the study of computer algorithms that improve automatically through experience.During the past decade,machine learning has witnessed a spike in multiple research domains producing state-of-the-art results in various tasks which were previously assumed to be difficult for computers to solve[135].Machine learning algorithms have played an important role in the development of bioinformatics tools used in sequence alignment,genome assembly,ab initio gene prediction,and other tasks.In recent years,machine learning has been applied in a wide range of rice research fields,including prediction of genomic elements and phenotypic outcomes and estimation of yield.DNA N6-adenine methylation(6 mA)is an epigenetic modification in prokaryotes and eukaryotes.Identifying 6 mA sites in rice genome is important for understanding epigenetic regulations in rice[136].Based on a machine-learning algorithm,an integrative bioinformatics framework for predicting 6 mA sites in the rice genome(SDM6A),was developed by Basith[136].Evaluation based on average accuracy and Matthews correlation coefficient shows that SDM6A has robust performance.In the same year,a machine learning-based bioinformatics algorithm,QTG-Finder,was developed that can be used to rank causal genes associated with a quantitative trait locus(QTL),thereby facilitating the identification of a causal gene of a QTL that controls traits of interest[137].Based on image classification using convolutional neural networks(CNNs),Desai et al.[135]proposed a bioinformatics method to estimate the heading date of rice by counting regions containing flowering panicles in ground-level RGB images.Based on the genomic data and phenotypic traits of the 3KRG project,Grinberg et al.[138]evaluated the performance of machine learning in phenotype prediction.For almost all the phenotypes considered,standard machine learning methods outperformed methods of classical statistical genetics.Using machine learning technologies,Keratiratanapruk et al.[139]developed a rice seed classification bioinformatics process for an automatic grading machine.Despite these achievements,use of machine learning in rice biological studies is still in its infancy and will have enormous influence on both basic and applied studies of rice biology in future.
Bioinformatics is now essential for almost all aspects of rice scientific studies,from genome sequencing to interpreting biological implication of all types of omics data and to genomic selectionbased breeding.Although we expect bioinformatics to play a greater role in future basic and applied biological studies not only in rice but also in all other plants,bioinformatics itself as a highly interdisciplinary branch of biology is also facing challenges,such as how to integrate a variety of databases for efficient use by biologists with or without advanced computing skills and how to integrate multi-omics data with phenotypic data to inform and guide genomic selection and whole-genome design breeding to speed up improvement of both rice grain yield and quality.Tackling these challenges requires a higher level of integration of bioinformatics with other disciplines,particularly deep learning.
CRediT authorship contribution statement
Longjiang Fan:Conceptualization,Data curation,Investigation,Resources,Funding aquisition,Supervision,Writing-review&editing.Lei Jia:Data curation,Investigation,Resources,Writing-original draft.Sangting Lao:Data curation,Investigation,Resources.Qian-Hao Zhu:Writing-original draft.
Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work was supported by the National Natural Science Foundation of China(31971865),Zhejiang Natural Science Foundation(LZ17C130001),the Innovation Method Project of China(2018IM0301002),and the Jiangsu Collaborative Innovation Center for Modern Crop Production.