YANG Hai-long, DONG Le, WANG Hui, LIU Chang-lin, LIU Fang, XIE Chuan-xiao
National Key Facility for Crop Gene Resources and Genetic Improvement/Institute of Crop Sciences, Chinese Academy of Agricultural Sciences, Beijing 100081, P.R.China
Abstract Phylogenetic trees based on genome-wide single nucleotide polymorphisms (SNPs) among diverse inbreds could provide valuable and intuitive information for breeding and germplasm management in crops. As a result of sequencing technology developments, a huge amount of whole genome SNP data have become available and affordable for breeders. However, it is a challenge to perform quick and reliable plotting based on the huge amount of SNP data. To meet this goal, a visualization pipeline was developed and demonstrated based on publicly available SNP data from the current important maize inbred lines, including temperate, tropical, sweetcorn, and popcorn. The detailed phylogenetic tree plotted by our pipeline revealed the authentic genetic diversity of these inbreds, which was consistent with several previous reports and indicated that this straightforward pipeline is reliable and could potentially speed up advances in crop breeding.
Keywords: phylogenetic tree, SNP, genetic diversity
Maize improvement is highly dependent on the success of a crop breeding program that is supported by access to abundant genetic diversity (Hoisington et al. 1999). One very common way to assess genetic diversity is to construct a phylogenetic tree of maize inbred lines. A phylogenetic tree of maize inbred lines can provide information about genetic distance (Jiao et al. 2012), pedigree background(Liu et al. 2003; Lorenz and Hoegemeyer 2013), and domestication history (Matsuoka et al. 2002), which will facilitate hybrid cross planning, inbred line development,heterosis research, and protection of line varieties (Pejic et al. 1998; Mohammadi and Prasanna 2003). The process of constructing a phylogenetic tree of maize inbred lines requires a genome-wide molecular marker. Currently, the most extensive way to evaluate genome-scale information is DNA markers, which are developed from enzymebased markers as wellas PCR-based markers and single nucleotide polymorphism (SNP) markers (Schl?tterer 2004). A SNP is the variation of one single base in the DNA sequence, but commonly, an alternative of two potential bases occurs at a specific sequence position (Vignal et al.2002). Its widespread distribution in crop plants, such as maize (Gore et al. 2009; Chia et al. 2012), rice (Feltus et al.2004; McNally et al. 2009; Alexandrov et al. 2015), wheat(Cavanagh et al. 2013; Choulet et al. 2014), and soybean(Lam et al. 2010; Li et al. 2014), enables SNP markers to adequately represent the high density of whole-genome information. Furthermore, in recent years, the SNP chip array and particularly affordable next-generation sequencing(NGS) technology have triggered the large-scale, highthroughput sequencing of a huge number of individual crop plants (Davey et al. 2011). In addition, the methods and software for genotype and SNP calling have laid a solid foundation for further analysis of genotyping data (Nielsen et al. 2011). Owing to these features, SNP genotyping is one of the main methods used to detect the genetic diversity among maize inbred lines. To date, application tools for evaluating genetic diversity have provided us with multiple methods to understand the population relationship among hundreds of individual crop plants. For example,PowerMarker can process multiple DNA markers to cluster plant lines based on statistical methods, such as the F-statistics and differentiation test (Liu and Muse 2005).Structure software uses a Bayesian clustering model to analyze DNA markers such as SNPs to stratify crop populations (Pritchard et al. 2000). PLINK provides us with another method called complete-linkage hierarchical clustering and multidimensional scaling (MDS) to evaluate plant population diversity (Purcell et al. 2007). SNPhylo software (University of Georgia, USA) can manipulate huge SNP data via various methods, such as principal component analysis (PCA) and relatedness analysis (Lee et al. 2014).However, one straightforward presenting method for genetic diversity from huge SNP data is still not practical because of the hurdle of pre-handling complicated raw data and subsequent graphic display using the application tools mentioned above.
In this present study, a simple method based on SNPhylo software was developed to address the problems mentioned above and to enhance the current genetic diversity analysis pipeline. We used Excel to mask the scaffold symbolof public SNP data that cannot be recognized by SNPhylo,adjusted the SNPhylo output treefile with MEGA software(Tamura et al. 2013), and edited the graphicfile using Adobe Illustrator CS6 (Adobe Systems Incorporated, USA). This study will provide us with another useful way of analyzing and displaying genotyping data as wellas enabling us to speed up the identification of genetic diversity in crop species.
To illustrate our pipeline, we used the public maize hapmap file as the example. A set of 251 maize inbred lines was chosen from the collection of publicly available germplasms around the world, such as North America, Africa, Europe,and Asia (Appendix A, modified from Flint-Garcia et al.(2005)), which represent the current public breeding lines from temperate, subtropical, and tropical lines as wellas popcorn and sweetcorn lines. This collection has been frequently used for maize research, such as genetic diversity and association analyses (Liu et al. 2003; Flint-Garcia et al.2005; Cook et al. 2012). The Maize SNP data we used in this study were downloaded from the PANZEA Genotypes database (see the URL: cbsusrv04.tc.cornell.edu/users/panzea/download.aspx?filegroupid=7) (Cook et al. 2012),and 251 samples were selected for analysis (Appendix B).
Data were analyzed on the Ubuntu Operating System (Linux,UK). SNP data were analyzed by SNPhylo software (chibba.pgml.uga.edu/snphylo/). Then, the Newickfile, i.e., tree format file, was manipulated using MEGA software (ver. 6.0).Finally, the modifiedfile was edited using Adobe Illustrator CS6. The whole pipeline is presented in Fig. 1.
Fig. 1 A flowchart of phylogenetic tree construction. DNA is extracted from fresh maize leaf tissues. DNA from maize inbred lines are analyzed with a sequencing machine, and the hapmap raw data were edited by Excel software to be input into the SNPhylo script. Finally, the resulting treefile is edited by MEGA for constructing phylogenetic trees using Adobe Illustrator.
When tissues are prepared and extracted, the next pressing step is to score SNP data. So far there are three main branches for SNP sequencing: PCR-based amplicon assay,immobilized array platform, and next generation sequencing.PCR-based amplicon assay typically embodies on KASPTM(LGC Group), TaqMan?(Thermo Fisher Scientific, USA),SNaPshot?Multiplex System (Thermo Fisher Scientific),and SNP Type? (Fluidigm, USA) assay, which use allelespecific PCR combined with fluorescence label technology to identify SNP (Thomson 2014; Campbell et al. 2015).Immobilized array platform uses the fluorescence intensity of hybridization signals between specific predesigned probes and fragmented DNA to detect the SNP site variations(LaFramboise 2009), such as Illumina (BeadArray and GoldenGate) and Affymetrix (GeneChip and Axiom)(Ragoussis 2009). NGS stands out with short read technology like Illumina and Ion Torrent as wellas long read technology such as Single Molecule Real Time (SMRT)sequencing technologies (Pacific Biosciences, USA) and Oxford Nanopore Technologies (Thomson 2014; Cornelis et al. 2017; Weirather et al. 2017). One of the very common approaches of these technologies is to do SNP genotyping by using whole genome or transcriptome sequence to call SNP variants, such as, genotyping by sequencing (GBS)(He et al. 2014). Once SNP data are prepared, the incoming step is editing and analysis. Here we used the selected public maize SNP hapmap data to demonstrate the pipeline.
To obtain a compatible import format, the maize SNP data were edited manually. Generally, maize SNP sequencing data, especially in hapmap format, and chromosome symbols from a couple of the last rows were associated with the scaffold. However, these rows of data were not detectable in SNPhylo software because the development of software is only for numeric chromosomes. Therefore, rows associated with the scaffold were substituted into “not-real”chromosome symbols that were based on the number of scaffold types to ensure that the SNPhylo software was able to detect these non-chromosomal SNP data. To illustrate this process, we used the undetectable scaffold part of the public hapmap data to show the details. For example,scaffold_252 was named chromosome 11, and scaffold_507 was named chromosome 12 because maize usually has 10 basic chromosome symbols. Data were imported into SNPhylo and analyzed with the following script: snphylo.sh-H HapMap_file -m 0.05 -a 13 -A, where H is the hapmap file, m is the minor allele frequency, a is the total number of autosomes in a modified hapmap file, and A is Perform multiple alignment by MUSCLE. More options for this script were based on the specific analysis plan, and detailed information is available on the SNPhylo software website(Lee et al. 2014).
The originaloutputfiles generated by SNPhylo made it difficult to see the relationship among hundreds of samples.Therefore, it is necessary for researchers to understand the data more concisely and intuitively when presenting huge data. To address this problem, each output format was searched to match the subsequent analyzing software.Interestingly, we found that the treefile was a Newick format file that can be imported into MEGA software and edited manually (Tamura et al. 2013). The tree format file was imported into MEGA with Display Newick Trees in the User Tree option to subsequently edit the data. With the options of swap tree and resize, wefinally obtained a more concise phylogenetic tree (Appendix C).
To better understand the distribution of these maize inbred lines, we performed color-grouping on the modified output phylogenetic tree (Appendix C) with Adobe Illustrator CS6 software (Adobe Systems Incorporated, USA). Based on the background of typical representative inbred lines and the phylogenetic relationship distances among them, three major groups and two minor groups were constructed(Fig. 2). The red group was inferred as tropical_subtropical(TS) since the majority of inbred lines come from CIMMYT and subtropicalareas with some representative lines such as CML69, NC296, Tzi 10, and Ki11. The green group represented the stiff stalk (SS) lines with the typical inbred line B73. The blue group represented the non-stiff stalks(NSS) with common lines, including A619, W22, and Mo17.In the two minor groups, the purple group contained several inbred lines from sweet corn, whereas the orange group mainly comprised popcorn inbred lines (Fig. 2). These results confirmed that this color-grouping method was reliable, and more importantly, it was highly consistent with previous data (Liu et al. 2003; Flint-Garcia et al. 2005).
Progress in analyzing and presenting huge genotyping data has been highly driven by the development of many outstanding pieces of software and tools. For example,PowerMarker, an early well-designed software, is able to illustrate clustering relationships and structure in population members by processing simple sequence repeats (SSRs),restriction fragment length polymorphisms (RFLPs), SNPs,and other format files (Liu and Muse 2005). Structure software plays a substantial role in analyzing genotyping data based on its widespread input data formats,manipulating platforms, efficient analyzing methods, and
Fig. 2 Final step: edit outputfile with Adobe Illustrator. The red group is tropical_subtropical, the green group is stiff stalk (SS),the blue group is non-stiff stalk (NSS), the purple group is sweet corn, the orange group is popcorn, and the out-of-group is mixed.The close-up box is an amplified part of the stiff stalk (SS) group.
Tropical_subtropical
S N P S M multiple considerations (Pritcharda et al. 2009; Porras-Hurtado et al. 2013). Another well-known tool is PLINK,which has high computational performance and significant analyzing features (Purcell et al. 2007). Compared to the programs mentioned above, the SNPhylo tool has its own specific advantages, including better accessibility of huge amounts of data and a highly automatic data import process without obscure data pre-handling (Table 1). However, the original SNPhylo manual didn’t show details of importingfile containing non-numeric chromosomes and how to plot detailed phylogenetic tree of huge SNP data by MEGA and Adobe Illustrator. Therefore, the simple method developed in this study provides a more elaborate and better way to parse genotyping data, which can meet the increasing demand of analyzing and visualizing huge SNP data generated by large-scale genotype sequencing. Interestingly, apart from SNPhylo, the outputfiles from Structure and PLINK can also be subjected to subsequent analysis (Table 1). For example,the Structure outputfile can be plotted and presented by Cluster Matching and Permutation Program (CLUMPP)(Jakobsson and Rosenberg 2007), distruct (Rosenberg 2004), Cluster Markov Packager Across K (CLUMPAK)(Kopelman et al. 2015), and Structure Harvester (Earl 2012). PLINK results can be displayed by gPLINK (zzz.bwh.harvard.edu/plink/gplink.shtml) and Haploview (Barrett et al. 2005; Barrett 2009). However, these methods still have their own limitations. the first limitation is that the raw datafile needs to convert the required corresponding format. The input format of SNP data for Structure software requires Genetic Analysis in Excel (GenAIEx) (Peakalland Smouse 2006), xmfa2struct (www.xavierdidelot.xtreemhost.com/clonalframe.htm), or Clustal X/W (Larkin et al. 2007)to convert sequence data into Structure input format. The input format of PLINK is a PED/MAPfile, which also requires tools or its own command, such as the recode option,to convert SNP data. More optimally, ourfirst step only used Excel to substitute the extra chromosome number for scaffold number to make maximum use of SNP data generated by crop plants, such as maize inbreds. The second limitation is the complicated subsequent analysis,such as plotting. In Structure software, output format files,such as indivq, popq, names, languages, and perm, as wellas parameter settings, such as K value modification, number of individuals (NUMINDS), and number of pre-defined populations (NUMPOPS), are necessary in the process of plotting population structure. PLINK can stratify population structure with its own command mds-plot that is believed to be slightly worse than PCA when correcting population structure in some specific genome wide association analysis(Wang et al. 2009) and further plot with R scripts. However,in our study, the last two steps just need one tree format file for the process of adjusting branches in MEGA and plotting groups in Adobe Illustrator, which seems much easier, more efficient, and faster than the methods discussed above.
Table 1 List of themaingenetic diversity analysis software programs1)
Thus, our method provides an alternative way to better analyze and present genetic diversity among maize inbred lines, which should speed up the identification and clustering of germplasm resources as wellas further breeding and association analysis.
Acknowledgements
This work wasfinancially supported by the National Natural Science Foundation of China (31361140364), the National Major Project for Developing New GM Crops, Ministry of Agriculture, China(2016ZX080009-001), and the Agricultural Science and Technology Innovation Program (ASTIP) of Chinese Academy of Agricultural Sciences to Xie Chuanxiao.
Appendices associated with this paper can be available on http://www.ChinaAgriSci.com/V2/En/appendix.htm
Journal of Integrative Agriculture2018年9期