Jihong Hu,Tao Zeng,Qiongmei Xia,Liyu Huang,Yesheng Zhang,Chuanchao Zhang,Yan Zeng,Hui Liu,Shilai Zhang,Guangfu Huang,Wenting Wan,8,Yi Ding,Fengyi Hu,*,Congdang Yang*,Luonan Chen,4,9,*,Wen Wang,8,*
1 State Key Laboratory of Genetic Resources and Evolution,Kunming Institute of Zoology,Chinese Academy of Sciences,Kunming 650223,China
2 State Key Laboratory of Hybrid Rice,College of Life Sciences,Wuhan University,Wuhan 430072,China
3 CAS Key Laboratory of Systems Biology,Center for Excellence in Molecular Cell Science,Institute of Biochemistry and Cell Biology,Shanghai Institutes for Biological Sciences,Chinese Academy of Sciences,Shanghai 200031,China
4 Institute of Brain-Intelligence Technology,Zhangjiang Laboratory,Shanghai 201210,China
5 Institute of Food Crop of Yunnan Academy of Agricultural Sciences,Kunming 650205,China
6 School of Agriculture,Yunnan University,Kunming 650500,China
7 BGI-Baoshan,Baoshan 678004,China
8 Center for Ecological and Environmental Sciences,Northwestern Polytechnical University,Xi’an 710072,China
9 School of Life Science and Technology,ShanghaiTech University,Shanghai 201210,China
Abstract Significantly increasing crop yield is a major and worldwide challenge for food supply and security.It is well-known that rice cultivated at Taoyuan in Yunnan of China can produce the highest yield worldwide.Yet,the gene regulatory mechanism underpinning this ultrahigh yield has been a mystery.Here,we systematically collected the transcriptome data for seven key tissues at different developmental stages using rice cultivated both at Taoyuan as the case group and at another regular rice planting place Jinghong as the control group.We identified the top 24 candidate high-yield genes with their network modules from these well-designed datasets by developing a novel computational systems biology method,i.e.,dynamic cross-tissue(DCT)network analysis.We used one of the candidate genes,OsSPL4,whose function was previously unknown,for gene editing experimental validation of the high yield,and confirmed that OsSPL4 significantly affects panicle branching and increases the rice yield.This study,which included extensive field phenotyping,cross-tissue systems biology analyses,and functional validation,uncovered the key genes and gene regulatory networks underpinning the ultrahigh yield of rice.The DCT method could be applied to other plant or animal systems if different phenotypes under various environments with the common genome sequences of the examined sample.DCT can be downloaded from https://github.com/ztpub/DCT.
KEYWORDS Dynamic cross-tissue(DCT);Systems biology;RNA-seq;Ultrahigh yield;Rice
Utilization of the heterosis of hybrids was reported to increase the rice yield by 15%-25% during past decades in China [1].Recently,based on the proportion of the national rice area represented by each location and rice cropping system,the national estimates of the potential rice yield in China are 6.8-9.8 metric tons per hectare(t·ha-1)whereas the farm yields range from 5.2 to 8.8 t·ha-1[2].However,rice breeding is now confronted with the challenge of overcoming the yield plateau[3].Interestingly,Taoyuan of Yunnan in China is a wellknown place where the highest rice yield in the world was recorded with an average rice yield of 13.91 t·ha-1[3,4].Taoyuan is a dry and hot valley of the upstream part of Yangtze River in Yunnan Province,where the temperature difference is large after heading,and the humidity is slightly lower throughout the growing period.Consistently,during 4 years of experimentations at Taoyuan,we observed that the rice yield at Taoyuan is at least 70% higher than that obtained at Jinghong which is a control place located south of Yunnan Province with a similar environment to most rice planting areas under the same cultivation management.Therefore,some unidentified environmental differences could leave their imprint in the epigenome and modify gene expression and regulatory networks [3,4].However,the traditional differential expression analysis only compares the differential gene expression in one tissue,and ignores the gene networks in multiple tissues.To investigate the genes and gene regulatory networks driving such ultrahigh yield observed at Taoyuan,we developed a new dynamic network analysis across tissues and developmental stages to identify candidate genes/networks accounting for the ultrahigh yield.
Integration or meta-analysis is a recently developed approach to study biological multi-tissue transcriptome data[5-9].Non-negative matrix factorization (NMF) is one such methodology,which in particular has the advantage to integrate multi-type high-throughput data,including RNA-seq or microarray data.Thus,NMF has been widely adopted in integration analysis involving heterogeneous data [10].However,the conventional methods usually cannot take these constraints into consideration in a biological context,such as tissue types and developmental stages,which severely limits their effectiveness.To integrate gene expression data across tissues and developmental stages by directly exploiting the biological context,we developed a new computational systems biology method,i.e.,dynamic cross-tissue (DCT) network analysis.DCT is based on the newly proposed jointcorrelation NMF (jcNMF) and differential co-expression networks (DENs) [11,12].Based on the integrative results of the jcNMF calculation,a systematic gene selection approach based on DENs was used to identify the key genes and the key gene modules of high yield with some functional validations(Figure 1).This comprehensive DCT analysis of multiple pairs of tissues across different developmental stages obtained from our field experiments provides a clear and inclusive view of the genes and networks driving the ultrahigh yield of rice at Taoyuan.
Field experiments using rice variety 9311 were conducted in 2010,2011,2013,and 2014 at Taoyuan and Jinghong,Yunnan,China (Figure S1 andTable 1).In the four testing years,the rice variety 9311 consistently showed significantly higher yields (88.91%,74.60%,92.61%,and 78.28% higher,respectively) at Taoyuan than that at Jinghong (Table 1 and Figure S1),whereas the yield at Jinghong (approximately 7 t·ha-1) was comparable to that at other typicalindicarice planting areas [2,13].These results showed that we could consistently obtain an ultrahigh yield of rice at Taoyuan,which is much higher than the gain of hybrid rice with only a 15%-25% increase [3,13].
Because Taoyuan is a dry and hot valley of the upstream part of Yangtze River in Yunnan Province,we selected Jinghong,which is a typicalindicarice planting region,as the control place(Figure S1).We recorded the temperature,humidity and monthly rainfall.And the records showed that Taoyuan has a high temperature before heading,and a low temperature after heading,which results in a high temperature difference,whereas the humidity was slightly low throughout the growing period (Table S1).The rainfall exhibited the largest difference between the two places,but we had good irrigating systems to avoid drought in the plots.We strictly used the same crop management practice,including the same plot area,planting density and use of fertilized nitrogen (225 kg·N·ha-1) at the two places.We set up 3-4 replicates of 15 m2plots and conducted careful phenotyping throughout the growth period.
We carefully dissected the phenotypic differences that might have contributed to the ultrahigh yield,and found that the number of effective panicles,grain numbers per panicles,seed setting rate,and 1000-grain weight all contributed to the ultrahigh yield at Taoyuan (Figure S1 A-E).However,none of the traits showed >70% increases at Taoyuan compared with those at Jinghong.This finding suggests that the ultrahigh yield observed at Taoyuan is a collective result from these traits combined with the underlying gene regulation and indicates that a systematic approach is needed to dissect such a complex trait.Because these four traits are related to tillering,panicle development,and photosynthesis of flag leaves at thegrain filling stage,we collected transcriptome data from the tiller bud,tiller root,young panicle,booting panicle,booting leaf,booting root,and flag leaf from Taoyuan and Jinghong rice,respectively.This study aimed to reveal the internal gene regulatory mechanisms accounting for the ultrahigh yield detected at Taoyuan (Figure 2andFigure 3A).
Table 1 Yield and yield components of rice variety 9311 at Taoyuan and Jinghong in 2010-2014
To systematically identify the key genes across multiple tissues for ultrahigh yield observed at Taoyuan,we developed a novel algorithm (DCT) differing from the traditional expression analysis (Figure 1,Files S1 and S2,https://github.com/ztpub/DCT).The mathematical model used in the DCT analysis utilizes joint correlation information (i.e.,soft constraints on tissue correlations in terms of gene modules) of NMF (jcNMF)instead of the conventional joint value (i.e.,hard constraints on tissue compositions in terms of gene modules) of NMF[9].We showed that the joint correlation in jcNMF can well characterize the associations among tissues from the observed data between the case and control(Figure 3B and C).Furthermore,based on the results from the jcNMF calculation,a systematic gene selection approach based on DENs[11]was used to capture the key genes and the key gene modules of high yield with some validations from the additional field and functional experiments.
The DCT approach maps genotype to phenotype via gene networks (or modules),i.e.,genotypes →networks →pheno types,rather than via directly linking/bridging the genotype and phenotype,i.e.,genotypes →phenotypes in the traditional way.There are 42,145 genes in the rice genome (IRGSP-1.0,http://rapdb.dna.affrc.go.jp/) and we obtained transcriptome datasets from 7 tissues collected in our field experiments.Theoretically,we would obtain a non-negative matrixXof 7 × 42,125 that includes all the raw data from either the case or the control.Using traditional differential expression analysis,a total of 343 differentially expressed genes (DEGs) was identified (Table S2).Therefore,we excluded 42,125 genes without significant expression changes (based on 1.2 fold change on the expression level as the cutoff) between the case and control samples,and thus,4714 DEGs were included inX.Experientially,the threshold of fold change used is 1.2,as a conventional two fold change will be too strict,which can permit more moderate (candidate) DEGs (usually including important transcriptional factors (TFs)) to be considered in the downstream analysis (i.e.,network-based analysis).Because TFs play important roles in gene regulation,we kept 1251 rice TFs from PlantTFDB in the analysis without considering their expression changes[14].In addition,26 genes which have been reported to be related to rice yield until now (Dec 16,2014)were included in the matrix regardless of their expression changes.These 26 genes were identified by other research groups using map-cloning,and their functions in rice yield were very clear (Table S3).In this study,these 26 genes serve as anchors in the co-expression networks to identify other candidate high-yield genes,which are not used for‘‘re-identification”.After removing the redundancies,we retained 5746 genes,which we call feature genes,in matrixX(Figure 2).
The expression data (fragments per kilobase of transcript per million mapped reads (FPKM) values) of these feature genes were grouped into the matrixX(X1for Taoyuan andX2for Jinghong)(Figure 2B).In the first step of the DCT analysis,we factorizedXintoWandHusing our proposed jcNMF,where one column ofWrepresents a developmental gene(co-expression)module or pattern among rice tissue samples,and one row ofHrepresents the tissue-specific gene set of each rice tissue.The computational algorithm for solvingHandWas well as its convergence proof,is shown in the Supplementary material (Files S1 and S2).Different from conventional approaches,the advantage of jcNMF is able to directly represent the biological context,such as the conserved relationships or correlations among gene modules across tissues (i.e.,the conserved tissue correlationsrather than the conserved gene module compositionsW1=W2in the case and control) (Figure 2C).This soft constraint onWwell characterizes the biological and developmental relatedness in the rice samples,which were also supported by real data (see the conservation levels between tissues and genes in Figure 3).Specifically,using the 5746 feature genes,we assessed the conservation levels in the seven tissues and their gene sets between Taoyuan and Jinghong.In Figure 3,the correlations (i.e.,WWT) of the tissues between the Taoyuan and Jinghong samples (Figure 3B and C) were more consistently conserved than those found for the genes(Figure 3D and E).Therefore,in our DCT analysis,we set,whereRis the conserved correlation matrix for tissues obtained from Figure 3B and C,rather than simplyW1=W2as in the traditional methods.Clearly,the hard constraintW1=W2is more restricted than the soft constraintwhich is biologically meaningful based on the observed data (Figure 2).
Figure 1 Flow chart of the identification of key genes using DCT network analysis
Figure 2 Work flow of the DCT network analysis
The second step of DCT is to construct the co-expression networks of genes.We calculated the Pearson’s correlation coefficient (PCC) between two columns/genes of eitherH1orH2.Figure 4provides a schematic of the approach used to obtain the gene set that will be used to construct coexpression networks for rice in one place.A tissue can be best characterized by a gene cluster.For example,the young panicle is best characterized by the sixth cluster in matrixW,which corresponds to the 5746 genes in the matrixH1(Figure 4).We selected those genes with significantly higher weights than the mean of the sixth row,which formed a gene set accounting for the young panicle of Taoyuan.Furthermore,we calculated the correlation coefficients between each pair of genes,and those gene pairs with significant correlation coefficients formed coexpression networks.We conducted the same procedure for the Jinghong samples,and those gene pairs that were included in only one of the networks were used to construct DENs for a certain tissue (Figure 4).The DENs of all the tissues comprehensively accounted for the difference in the gene expression networks between Taoyuan and Jinghong rice throughout the growth process,and they link/bridge the internal gene expression patterns with the ultrahigh yield,i.e.,genotypes →DEN →phenotypes,in terms of the associations.
Finally,the top candidates from the set of ultrahigh yieldassociated genes (or key genes) were selected from the DEN of each tissue.The criterion used for this selection is the rank of the relatedness of a gene with prior-known yield-associated genes (i.e.,theR(x) value;see ‘‘Materials and methods”).To obtain a strong signal of high yield,in this study,we selected genes that were ranked in the top 30 based on theR(x) values(i.e.,based on the cross-tissue co-expressed network structure and state).We selected the top 30 candidate genes with highest differential associations in each tissue (Table S4).Then,24 candidate high-yield genes among the top 30 genes were found in at least four of the seven tissues,but their expression levels were not significantly difference between Taoyuan and Jinghong rice based on the traditional differential expression method(Figure 5A,Tables S2 and S5).In total,112 candidate genes were screened by DCT analysis,and only three DEGs overlapped (Figure 5B).Particularly,nine of 24 candidate high-yield genes were identified in the young panicle using DCT analysis (Figure 5B).
Figure 3 Evaluation of the conservation levels of tissues and genes between Taoyuan and Jinghong samples
The 24 high-yield candidate genes identified using the DCT algorithm showed large association (network) changes but moderate expression changes with fold changes only larger than 1.2.This explains why they would be disregarded by the traditional differential expression analysis method that considers only significant expression changes (mainly fold change >2).Additionally,the key TFs were screened by the DCT analysis to reveal their roles in the regulation network of candidate high-yield genes (Figure S2 and Table S6).Gene ontology (GO) enrichment analysis also showed that these 24 candidate genes were involved in ‘‘nitrogen compound metabolic process”(Figure 5C and Table S7).The DEN of the 24 candidate genes showed that they exhibited more associations with the yield-associated genes at Taoyuan than that at Jinghong(Figure 5D and E,Figure S3).These results further supported that most of these 24 genes might play important roles in the ultrahigh yield of Taoyuan rice,and are probably the key network-hubs controlling the yield.
On the one hand,as the core of the DCT algorithm,jcNMF has a similar ability to that of conventional NMF to capture the local pattern during dimension reduction.Although the analyzed expression data contain seven tissues andXis actually a low-rank matrix,the local pattern (i.e.,tissue conservation) rather than dimension reduction (i.e.,gene filtering)would be the main target using jcNMF.On the other hand,one main merit of jcNMF is to reflect the conserved tissue associations during integrative data analysis based on the proposed soft constraint.To evaluate the efficiency of jcNMF,several typical integration methods have also been applied and compared according to their influence on the tissue associations caused by corresponding data transformations.Simply,the tissue or sample association can be directly shown and compared as hierarchical trees,as shown inFigure 6.
Obviously,jcNMF can reflect or recover the tissue associations based on the Euclidean distance or Pearson’s correlation,e.g.,two panicle samples would be clustered together;and two leaf samples would be also clustered together (Figure 6A).By contrast,all other methods have certain limitations:(i)the conventional NMF-based method (TriNMF) ignores the associations between root and panicle samples after matrix factorization,although leaf samples can be clustered together(Figure 6B);(ii) mixOmics principal component analysis(mixOmics PCA) was applied for feature reduction,but the association between panicle samples were also missed (Figure S4A);(iii) partial least squares-discriminant analysis(PLSDA) is a supervised method but confuses the tissue associations due to data transformation (Figure S4B);(iv) the batch-effect removing approach Combat would change the association between the root or panicle samples after adjusting data variances(Figure S4C);and(v)the pattern fusion method similarity network fusion(SNF)clusters leaf samples well,but it still shows some confusing associations between the root and panicle samples (Figure 6C).
The well-known weighted correlation network analysis(WGCNA) was also used in this study,although the number of samples in this work was actually less than that generally required by WGCNA.The WGCNA results were similar to those obtained with jcNMF,but the former approach still sensitive to the clustering distances used in the analysis (i.e.,the subtree among the bud and root samples showed only a slight change when different cluster distances were used)(Figure 6D).And its detected modules cannot be associated with the casecontrol samples according to the trait-association test.Thus,it will be difficult to perform follow-up gene selection and function analysis under this condition.
Therefore,jcNMF outperforms other existing integrative data analysis methods to maintain the biological context(i.e.,tissue conservation) in the integrative data analysis,and thus,the DCT algorithm is better able to discover downstream genes,modules,networks,and functions.
We further evaluated the node degree of DENs in each tissue to identify the tissue whose DEN showed the most significant change between Taoyuan and Jinghong rice as determined using the matched pairt-test.Interestingly,the young panicle exhibited the lowestPvalue (Figure 7A),indicating that it is the major tissue that causes the greatest changes in the gene co-expression networks for the ultrahigh yield of rice at Taoyuan.For the young panicle,the DEN of the top 30 candidate genes associated with yield-associated genes at Taoyuan and Jinghong was reconstructed (Figure 7B and C).Clearly,there are high associations between our selected candidate genes and prior-known yield-associated genes (Table S3) in the module-based co-expression network.In Taoyuan specific networks,there are many module genes associated with known yield-associated genes LOC_Os09g35980 (TAC1),LOC_Os06g40780 (MOC1) or LOC_Os06g06050 (D3) (e.g.,nodes with large degrees in network visualization),for example,the candidate genes LOC_Os05g41240 (MYB),LOC_Os06g11860 (ERF),and LOC_Os03g28990 (zinc finger)(Figure 7B).Furthermore,the prior-known yield-associated genes LOC_Os09g35980 and LOC_Os06g06050 also exhibited significant associations in Taoyuan rice (Figure 7B).The increased number of associations among candidate genes and yield-associated genes in Taoyuan rice compared with those in Jinghong rice can be considered to have a stronger driving influence on ultrahigh yield (Figure 7B and C).
Figure 4 Identification of gene sets for featuring a tissue
The importance of the young panicle can be further supported by the results of the GO enrichment analysis of the top 30 candidate genes from the young panicles (Figures 5C,S5,and Table S7).Compared with other tissues,the composition of GO enrichments from the young panicle is very similar to that from the previously reported 26 yield-associated genes(Figure S5).Many of the candidate genes in the young panicle are involved in ‘‘nitrogen compound metabolic process”(P=0.000084) (Figure S5 and Table S7).Moreover,nine of the final 24 candidate high-yield associated genes screened from the seven tissues by the DCT analysis were found in the young panicle,and this number was higher than those found in the other tissues(Figure 5B,Tables S4 and S5).Thus,these key genes in the young panicle would not only exhibit expression associations with the known yield-associated genes,but also functional similarity with yield-associated genes.
We firstly used qRT-PCR to validate the changes in the expression levels of eight candidate high-yield associated genes in the young panicle samples of the rice variety 9311,includingOsMADS1(LOC_Os03g11614) and AP2 transcription factor(LOC_Os09g26420) (Figure S6).Interestingly,three MADS box genes,OsMADS1(LOC_Os03g11614),OsMADS57(LOC_Os02g49840),andOsMADS72(LOC_Os03g14850),were identified as candidate high-yield associated genes by the DCT analysis in our study.Previous studies have reported that MADS-box genes encode TFs that are involved in reproductive development,including flowering induction and flower meristems as well as in the regulation of fruit,seed and embryo development [15-17].Our qRT-PCR results consistently showed that the expression ofOsMADS1was up-regulated in Taoyuan rice (Figure S6 and Table S5).AP2 have been reported to be involved in rice starch biosynthesis and the improvement of grain yield under stress [18,19].In our study,AP2(LOC_Os09g26420) was also identified to be a candidate high-yield associated gene by DCT analysis,and its expression level in young panicles was validated by qRT-PCR(Figure S6).These results showed that our transcriptome data are reliable.
Figure 5 Expression analysis and associative networks of the candidate high-yield associated genes screened by DCT analysis
To solidly validate the candidate key genes identified by DCT analysis,we further edited theOsSPL4(LOC_Os02g07780) gene via CRISPR/Cas9.Sequencing analysis of the targeted site revealed a 3-bp heterozygous deletion mutation produced by CRISPR/Cas9 in the T0 plants(Figure 8).We further obtained heterozygous,mutationhomozygous and wild type (WT) plants in the T2 segregation population.We compared the phenotypes between theOsSPL4-edited(both heterozygous and homozygous) T2 lines and WT plants (Figure 8).The plant heights of theOsSPL4-edited lines were slightly increased (Figure 8A),and the Cas9-edited plants exhibited longer panicles and a larger number of grains per panicle than WT plants(Figure 8B-G).Strikingly,for our primary analysis,the yield of these homozygously mutated plants was significantly higher than that of the WT plants (Figure 8G).The expression level ofOsSPL4was down-regulated in Taoyuan rice (Table S5) and the rice variety 9311 at Taoyuan also exhibited a higher grain number per panicle than that at Jinghong,implying thatOsSPL4is a key gene for the ultrahigh yield at Taoyuan(Figure 8 and Table 1).In addition,the association analysis between environmental factors and LOC_Os02g07780(OsSPL4)showed that the expression level of this gene is negatively associated with the average temperature difference but positively correlated with the average relative humidity at Taoyuan (Figure S7C),indicating that this gene is indeed a regulatory factor responding to environments.OsSPL4is an SBP-box gene and previous studies have shown that some SBP-box genes were involved in panicle development and yield in rice[20-23].This study provides the first demonstration thatOsSPL4is a key regulatory gene in the ultrahigh yield of Taoyuan rice,and shows that our DCT is an effective method to identify key genes and networks affecting the formation of a complex trait.
Figure 6 Comparison of the conservation of tissue associations (Euclidean distance or Pearson correlation) using different methods
To our knowledge,this is the first systematic analysis of the multiple tissues of rice across developmental stages,and we attempted to integrate the transcriptomic data with the aim to identify key genes and networks for agronomic traits in plants.The environment affecting the same genome with different characteristics has been widely documented in many organisms,such as twins [24],yet the mechanisms in plants have not been well elucidated.As an interesting case,we observed the ultrahigh yield of rice at Taoyuan,which was found to be at least 70%higher rice yield than that in the control area under the same cultivation management over the 4 years of field experimentation.
Figure 7 The young panicle plays an important role in the ultrahigh yield at Taoyuan
To reveal the internal key genes and their modules underlying the ultrahigh yield at the network level,we developed a dynamic meta-analysis framework across tissues and developmental stages,i.e.,the DCT algorithm with jcNMF,which can construct the associations of tissues and gene modules to a specific phenotype(ultrahigh yield in this study)by integrating gene expression profiles within the biological context.Notably,we identified the gene-modules by conducting the study on the cross-tissue and multi-developmental stages.Based on our model,the gene compositions of those gene-modules were conserved across tissues,but their expression levels were generally different depending on the tissues.Indeed,as a matrix decomposition based approach,jcNMF can not only capture the local pattern from expression data (i.e.,capturing gene modules) in a standard manner,but also maintain the global pattern in a biological context (i.e.,reserving tissue conservations)in a new way,which is implemented as Figure 4 and supported by the comparisons shown in Figure 6.In addition,the follow-up differential network analysis of gene modules can reveal molecular details of key genes at the network level rather than at the expression level,which would be more efficient than the conventional WGCNA method (Figures 5 and 6).Overall,the DCT algorithm is a powerful computational method for cross-tissue biological data analysis.It could be extended to other general integrative analysis by considering various types of fundamental matrix decomposition models and categories of temporal-spatial contexts/constraints.
Figure 8 CRISPR/Cas9 experimental validation of a previously functionally unknown gene (OsSPL4) identified as a key high-yield associated gene by our DCT analysis
Supporting the DCT analysis here,the panicle size and branch number in a panicle are directly associated with the rice productivity.A previous study reported thatOsSPL14which is highly expressed in the young panicle can increase the primary branches of panicle,leading to a high yield in rice[20,21].In the present study,the candidate yield associated geneOsSPL4,which is anotherSPLgene,confirmed that this gene also increases the grain number of panicle and grain yield(Figure 8).TheOsSPL4should be the key gene for ultrahigh yield observed in Taoyuan rice due to its contribution of the large number of panicles and grain number per panicle (Figure 8 and Table 1).In addition,Taoyuan has different climates with large temperature differences after heading and slightly low humidity throughout the growing period (Table S1).In the present study,OsSPL4was identified with significant associations with these environmental factors (Figure S7).
In our systematic study,both of our selected 24 candidate genes and 26 prior-known yield-associated genes were enriched in two significant pathways with the GO terms as ‘‘nitrogen compound metabolic process”and ‘‘nucleic acid metabolic process”.Particularly,many of the candidate genes discovered in the young panicle are involved in ‘‘nitrogen compound metabolic process”(P=0.000084) (Figure S5 and Table S7),implying that these candidate genes are involved in nitrogen metabolism.It has been documented that nitrogen is actually a major driving force for crop yield improvement,and nitrogen absorption and metabolism can affect rice growth and production [25].In our study,one candidate gene LOC_Os11g02480,which encodesWRKY46,was identified as the key gene(Table S5).It has been reported thatOsWRKY46is involved in the iron stress response and the promotion of leaf development,whereas excess Fe may cause yield loss due to leaf bronzing in rice [26].Another study has reported that the expression level ofWRKY46was induced in the rice leaf sheath under N-starvation [27].Therefore,the 24 candidate genes identified in our study should be the key genes for the ultrahigh yield of Taoyuan rice.Further studies on these genes may provide more genetic resources for a high yield of rice.
In summary,this study developed a systems biology approach to identify both the key tissue and high-yield associated genes,and to elucidate the associations between the gene expression network and the ultrahigh yield of rice at Taoyuan.The results shed novel light on our understanding of the genetics of the ultrahigh yield of rice,or even all Gramineae crops.The DENs and key candidate high-yield genes provided rich information to achieve a much higher yield in rice and other Gramineae crops by artificially regulating or perturbing the identified gene networks.In this work,we mainly considered network information for the identification of high-yield genes,and as one future topic,we can further explore dynamic information,such as dynamic network biomarker (DNB) [30-35],from time-course data to improve the approach in terms of effectiveness and efficiency.Importantly,the DCT analysis approach could also be applied to other plant or animal systems if different phenotypes under various environments with the common genome sequences of the examined sample,such as twins or plants exposed to stress conditions.
Field experiments using the rice variety 9311 were conducted in 2010,2011,2013,and 2014 at Taoyuan and Jinghong,Yunnan in China(Figure S1F and Table 1).We chose Jinghong,which is a typicalindicarice plating region,as the control place.The same crop management practice was strictly used,such as the same plot area,planting density and fertilized nitrogen use(225 kg·N·ha-1) at the two places.We set up 3-4 replicates of 15 m2plots and conducted careful phenotyping throughout the growth period.Water,weeds,insects,and disease were controlled because their control is needed to avoid yield loss.
Tiller buds,tiller roots,young panicles,booting panicles,booting leaves,booting roots,and flag leaves of the rice variety 9311 from Taoyuan and Jinghong were collected,immediately frozen in liquid nitrogen and then kept at-80°C.Total RNA of the tissues was extracted and determined using the Nano-Drop ND-2000 system (Thermo Scientific,Waltham,MA),followed by sequencing using an Illumina HiSeq 2500 platform.Raw reads were filtered by in-house Perl script,and then clean reads were used for further analysis.The clean reads were performed using the TopHat and Cuff links package[36,37].The transcript levels were qualified as FPKM generated by Cuff links [36].Then,bioinformatics analysis in this study was conducted,and the work routine is shown in the flow chart as Figure 1.
To integratively analyze the factors affecting the high yield of rice at Taoyuan,we developed a computational algorithm of the DCT network analysis to study multiple tissues and multi-developmental stages of rice as described below (File S1 and File S2).
Using RNA sequencing techniques,transcriptome data analysis produced the numeric matrix of FPKM values of rice genes,whereXdenotes the dataset of Taoyuan or Jinghong,xmndenotes the FPKM value,nis the number of genes,andmis the number of sampled tissues:
The first step of DCT is matrix factorization.Because the FPKM and many other types of biological data are nonnegative,NMF is widely used [38,39]to analyze such data,andWandHare two factor matrices ofX:
When he looked about he found that he was in the very same place he had jumped from; there was the palace, there the garden and the deer! Eight times he leaped over the wall and eight times found himself where he had started from; but after the ninth leap there was a change, there was a palace and there was a garden, but the deer were gone
X=W·H,
where the solution is in the following format:
Here,the biological meaning ofWis the gene-modules (or networks) of samples (or individuals) (i.e.,the developmental gene co-expression patterns among rice tissues),and the biological meaning ofHrepresents the gene-module expressions among samples corresponding to phenotypes (i.e.,the tissuespecific gene-module levels of each rice tissue).Note thatXis the observed data,whereasWandHare unknown variables to be solved.
Many algorithms based on NMF were developed to solveWandH,typically,joint non-negative matrix factorization(jNMF)[9,10,40],which prescribes the sameWof two or more NMF equations as a hard-constraint:
where note thatW1=W2,implying the conserved tissue compositions in terms of gene-modules in case samplesW1and control samplesW2.
In contrast,in DCT,we used a new jcNMF to factorizeXintoWandH:
The advantage of jcNMF is to make the relationships between each lines (tissues) inW1be the same as those inW2,so that tissues more closely related are presented by lines,more similar to each other in eachW.The biological meaning of this soft-constraint is thatWshould be consistent with biological and developmental relatedness.Clearly,the hard constraintW1=W2is stronger (or more restricted) than the soft constraintwhich is biologically meaningful based on the observed data(Figure 3).In this way,jcNMF indicates that the conservation of expression correlation between tissues should be carefully considered rather than the conservation of the expression levels of tissues during the cross-tissue integrative analysis.The jcNMF algorithm resolving this equation and its convergence proof are shown in the File S1 and S2,respectively.
Then,the second step of DCT is to construct the coexpression networks of genes.We calculated the PCC between each two columns/genes of eitherH1orH2:
whereaibelongs to column A,andbibelongs to column B ofH,respectively.σaand σbare the standard deviations of column A and column B,respectively.If two columns/genes are significantly correlated(P<0.05),they are placed in the gene co-expression network.Eventually we constructed the gene coexpression networks ofH1andH2for the case and control samples,namedC(H1)andC(H2),respectively.The difference setsC(H1) -C(H2) andC(H2) -C(H1),namedDiff1andDiff2,respectively,represent the case/control specific gene coexpression network,i.e.,the DEN [11,12,41].
The final step involves the selection of the potentially phenotype-associated genes(or so-called key genes).The main criterion used in this selection is the rank of the relatedness of a gene with prior-known to be phenotype-associated genes(here the phenotype is just the yield of rice):
Sis the set of the known yield associated genes,vis a gene in setS,anduis the gene of interest.We selected genes whoseR(x) value in the top 30 as the final results of the DCT analysis based on integrative consideration of each sample/tissue,and they can be further ranked by the product rank ofR(x),expression fold-change and co-expression network degree.
In addition,to evaluate the structural changes in the tissuespecific co-expression network between Taoyuan and Jinghong,thePvalues of the degree changes of the genes in coexpression network are calculated.For each tissue,given a kind of feature genes (e.g.,TFs,yield-associated genes or DEGs),their degrees in the co-expression network from Taoyuan (i.e.,a degree vector DTaoyuan) should be different from those in the co-expression network from Jinghong (i.e.,a degree vector DJinghong).This difference is evaluated by thePvalue of the matched-pairt-test on two numeric vectors DTaoyuanand DJinghong.The matched-pairt-test to evaluate the degree of differential DENs in each tissue was performed using MATLAB 2012a [42].And the formula for a pairedttest:
whered=sum of the differences in the vector elements.For the young panicle,such degree changes are significantly observed in all three types of feature genes.Particularly,compared with other tissues,significant network-degree changes of the reported yield genes and yield-associated genes in DEN are only observed in the young panicle.Thus,it would be a high priority to further investigate the associations among transcription factors,yield-associated genes and DEGs to identify the candidate key genes driving the ultrahigh yield observed in Taoyuan rice.
Functional annotations of the DEGs and candidate genes were performed to search against the GO database [43].The top 30 candidate high-yield genes of each tissue from DCT network analysis were also analyzed by GO enrichment analysis.The results of GO annotations were submitted to AgriGO for enrichment analysis,and GO terms with corrected FDR<0.05 were considered to be significantly enriched[44].
The expression levels of eight high-yield candidate genes were randomly selected to be verified by qRT-PCR using the same RNAs that was used for RNA-seq [45].Rice geneactin1was used as the internal control for qRT-PCR analysis(Table S8).And then real-time RT-PCR was performed on an ABI StepOne Real-Time PCR System(Applied Biosystems,Foster City,CA)with three replicates using a FastStart Universal SYBR Green Master (Roche,Mannheim,Germany).The relative expression level was normalized and quantified using the 2-△△CT method[46].Significant differences of the expression levels between Taoyuan and Jinghong samples were evaluated using Student’st-test (*,P<0.05;**,P<0.01).
To editOsSPL4,a 20-bp sequence (5′-AGGTGAGGTGC CAGGTGGAA-3′) in the exon of the gene was selected as the target of the guide RNA(gRNA)using the CRISPR-P tool(http://rice.hzau.edu.cn/cgi-bin/rice/CRISPR_rice) [47].Synthetic oligonucleotides containing the target and adaptor sequences were annealed and then subcloned into theAarI restriction sites of the gRNA cloning vector (Table S8).The construct was introduced into theAgrobacterium tumefaciensstrain EHA105 by electroporation,and positive agrobacteria were used to infect rice Nipponbare callus as previously described [48].After the regeneration of plants,the target region was sequenced to screen for mutants,and T2 homozygous,heterozygous and wild type ofOsSPL4lines were identified for phenotyping.
The sequencing data for this project have been deposited in the Genome Sequence Archive [49]at the National Genomics Data Center,Beijing Institute of Genomics,Chinese Academy of Sciences/China National Center for Bioinformation(GSA:(CRA002804),and are publicly accessible at http://bigd.big.ac.cn/gsa.The data are also at the NCBI Sequence Read Archive(SRA:SRP213003).
DCT can be downloaded from https://github.com/ztpub/DCT.
Jihong Hu:Investigation,Methodology,Resources,Writing -original draft preparation.Tao Zeng:Methodology,Software,Writing -original draft preparation.Qiongmei Xia:Investigation,Resources.Liyu Huang:Investigation,Validation.Yesheng Zhang:Validation.Chuanchao Zhang:Software.Yan Zeng:Methodology.Hui Liu:Resources.Shilai Zhang:Investigation.Guangfu Huang:Investigation.Wenting Wan:Investigation.Yi Ding:Validation.Fengyi Hu:Supervision,Validation.Congdang Yang:Resources,Investigation.Luonan Chen:Conceptualization,Methodology,Software,Writing -reviewing &editing.Wen Wang:Conceptualization,Methodology,Supervision,Writing -reviewing &editing.All authors read and approved the final manuscript.
The authors have declared no competing interests.
This work was financially supported by the National Basic Research Program of China (Grant No.2013CB835200),the National Key R&D Program of China (Grant No.2017YFA0505500),the Key Grant of Yunnan Provincial Science and Technology Department (Grant No.2013GA004),the Strategic Priority Research Program of the Chinese Academy of Sciences (Grant No.XDB13040700),the National Natural Science Foundation of China (Grant Nos.11871456 and 31771476),the Shanghai Municipal Science and Technology Major Project (Grant No.2017SHZDZX01),and the Open Research Fund of State Key Laboratory of Hybrid Rice (Wuhan University,Grant No.KF201806),China.
Supplementary data to this article can be found online at https://doi.org/10.1016/j.gpb.2019.11.007.
0000-0002-2512-4056 (Jihong Hu)
0000-0002-0295-3994 (Tao Zeng)
0000-0003-0682-5735 (Qiongmei Xia)
0000-0001-9666-6584 (Liyu Huang)
0000-0001-6143-5161 (Yesheng Zhang)
0000-0003-0690-613X (Chuanchao Zhang)
0000-0001-7930-264X (Yan Zeng)
0000-0003-0578-4062 (Hui Liu)
0000-0001-9791-2468 (Shilai Zhang)
0000-0001-9711-5979 (Guangfu Huang)
0000-0001-9338-3626 (Wenting Wan)
0000-0002-9941-411X (Yi Ding)
0000-0003-4498-9490 (Fengyi Hu)
0000-0002-8478-3748 (Congdang Yang)
0000-0002-3960-0068 (Luonan Chen)
0000-0002-7801-2066 (Wen Wang)
Genomics,Proteomics & Bioinformatics2020年3期