The Tartary buckwheat (Fagopyrum tataricum) genome project was initiated through the Post genome Program by a consortium led by Yul Ho Kim, Su Jeong Kim, Hwang Bae Sohn, Sunghoon Lee, Dong-Ha Oh, Sin-Gi Park.
De novo genome sequencing of tartary buckwheat began in the early of 2014 and was completed lately in 2017. To obtain a high-quality draft genome assembly, we produced total 43.83 and 32.17 Gb sequences from Illumina paired-ends (PE) and Single-Molecule Real-Time (SMRT) sequencing platforms, respectively, which corresponded to 70x (Illumina PE) and 52x (SMRT) coverages. A hybrid assembly followed by scaffolding, gap-filling, and cleaning of redundancy resulted in a final draft assembly of 526.94 Mb in 2,566 scaffolds with 50% of the total sequence captured in 156 scaffolds larger than 886,968 bps (N50). We predicted total 43,771 putative protein-coding gene models occupying 19.33% of the genome, while 52.00% consisted of repetitive sequences and transposable elements, with Gypsy family long terminal repeat (LTR) retrotransposons being the most abundant class. We are currently preparing to publish a paper about the draft genome of tartary buckwheat.
Approximately 526.94Mb arranged in 2,566 scaffolds
Approximately 565.10Mb arranged in 4,433 contigs
Scaffold N50 = 886,968bp
Contig N50 = 463,432bp
137 scaffolds larger than 1 Mbps, with above 50% of the genome in 156 scaffolds
Total 43,771 putative protein-coding gene models were predicted.
Sequencing, Assembly, and Annotation
We prepared both short read (Illumina)and long read (PacBio) libraries to cover the entire genome of entire genome of F. tataricum. Sequencing libraries were prepared from genomic DNA using Illumina HiSeq2500 (2 × 101 bp) and PacBio RSII platforms (>3Kb). In brief, a short insert (350 bp) paired-end (PE) library was constructed using TruSeq DNA library Prep Kit (Illumina) according to the manufacturer instructions. Single Molecule Real Time (SMRT) bell libraries were prepared from the large scale amplified cDNA as recommended by Pacific Biosciences (Palo Alto, U.S.A). SMRT bell templates were bound to polymerase using the DNA polymerase binding kit P6 v2 primers.
How was the assembly generated?
Whole genome de novo assembly for F. tataricum was performed via hybrid approach as follows: Long SMRT sequencing reads were assembled using Fast Alignment andCONsensus (FALCON) (Chin et al., 2016), whereas 350-bp short insert reads were assembledusing SOAPdenovo2 (Luo et al., 2012) with default parameters. Before assembly, all Illuminareads were subjected to preprocessing (adapter, quality, duplicates trimming). The initialcontigs were merged two assemblies using HaploMerger2 (Huang et al., 2017). Both shortand long reads were then used to construct scaffolds with SSPACE software (Boetzer et al.,2011) followed by gaps were filled with the short read data using GapFiller (Nadalin et al.,2012). We used CoGE SynMap (Lyons et al., 2008) and LASTZ (Kiełbasa et al., 2011) todetect and filter out redundant genomic regions (>98% sequence identity over >7Kb) togenerate the final draft assembly. The hybrid assembly resulted in a final draft assembly of 526.94 Mb in 2,566 scaffolds with 50% of the total sequence captured in 156 scaffolds larger than 886,968 bps (N50).
Is it accurate?
To test the accuracy of the genome assembly, we applied classical Sanger sequencing methods on two BAC clones of 121.85Kb and 61.50Kb that contain gene loci for the homologs of two previously known Fagopyrum FLS coding sequences. The 121.85Kb BAC clone (“29-J17”) contained a gene locus for FtFLS1 (NCBI GenBank ID: JF27561), while the 61.50Kb BAC clone (“32-I01”) included a locus for a partial sequence of putative FLS (GenBank ID: HM357805). Both BAC clone sequences, assembled from contigs generated by Sanger sequencing, showed >99% sequence identity with their corresponding genomic regions in the draft genome assembly.
We predicted gene models in the draft genomes of F. tataricum cv. Daegwan by combining evidence from transcriptome and protein sequence alignments with ab initio prediction on repeat-masked genome sequences. GeneMark-ET (Lomsadze et al., 2014) was used to perform iterative training and to generate initial gene structures with RNA-Seq data information. AUGUSTUS (Stanke and Morgenstern, 2005) was further used to perform de novo prediction with gene models trained by GeneMark-ET, with exon-intron boundary information predicted by transcriptome and protein sequence alignments. We used TopHat (Trapnell et al., 2012) for RNA-Seq alignment and Exonerate(Slater and Birney, 2005) for protein sequence alignment with similar species sequences. We annotated deduced protein sequences through BLASTP searches with an e-value cutoff of 1e-10 460 against NCBI non-redundant database, UniProt, and Interproscan. Occurrence and frequency of repeats, including retrotransposons, DNA transposons, microsatellites, and other repeats, were screened using RepeatMasker (Tarailo-Graovac and Chen, 2009). Further, the repeat masked scaffolds were used for gene prediction as described above.
Is it complete?
Compared to the draft genome of F. tataricum cv. Pinku, the problem was that the number of annotated genes and duplicated BUSCO was high in the draft genome of F. tataricum cv. Daegwan. And so, we are currently working on improving the draft genome of F. tataricum cv. Daegwan.
Yul Ho Kim (email: firstname.lastname@example.org)
Highland Agriculture Research Institute, National Institute of Crop Science, Rural Development Administration, Pyeongchang 25342, Korea
Boetzer, M., Henkel, C. V., Jansen, H.J., Butler, D. and Pirovano, W. (2011) Scaffolding preassembled contigs using SSPACE. Bioinformatics, 27, 578–579. Available at: http://www.ncbi.nlm.nih.gov/pubmed/21149342
Chin, C.-S., Peluso, P., Sedlazeck, F.J., et al. (2016) Phased diploid genome assembly with single-molecule real-time sequencing. Nat Methods, 13, 1050–1054. Available at: http://www.ncbi.nlm.nih.gov/pubmed/27749838
Huang, J., Deng, J., Shi, T., et al. (2017) Global transcriptome analysis and identification of genes involved in nutrients accumulation during seed development of rice tartary buckwheat (Fagopyrum Tararicum). Sci Rep, 7, 1–14
Kiełbasa, S.M., Wan, R., Sato, K., Horton, P. and Frith, M.C. (2011) Adaptive seeds tame genomic sequence comparison. Genome Res, 21, 487–493. Available at: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3044862&tool=pmcentrez &rendertype=abstract
Lomsadze, A., Burns, P.D. and Borodovsky, M. (2014) Integration of mapped RNA-Seq reads into automatic training of eukaryotic gene finding algorithm. Nucleic Acids Res, 42, e119. Available at: https://academic.oup.com/nar/article-lookup/doi/10.1093/nar/gku557
Luo, R., Liu, B., Xie, Y., et al. (2012) SOAPdenovo2: an empirically improved memory efficient short-read de novo assembler. Gigascience, 1, 18. Available at: http://www.ncbi.nlm.nih.gov/pubmed/23587118
Lyons, E., Pedersen, B., Kane, J., et al. (2008) Finding and Comparing Syntenic Regions among Arabidopsis and the Outgroups Papaya, Poplar, and Grape: CoGe with Rosids. Plant Physiol, 148, 1772–1781.
Nadalin, F., Vezzi, F. and Policriti, A. (2012) GapFiller: a de novo assembly approach to fill the gap within paired reads. BMC Bioinformatics, 13 Suppl 1, S8. Available at: http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-13-S14-S8
Stanke, M. and Morgenstern, B. (2005) AUGUSTUS: a web server for gene prediction in eukaryotes that allows user-defined constraints. Nucleic Acids Res, 33, W465-7. Available at: https://academic.oup.com/nar/article-lookup/doi/10.1093/nar/gki458
Slater, G.S.C. and Birney, E. (2005) Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics, 6, 31. Available at: http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-6-31
Tarailo-Graovac, M. and Chen, N. (2009) Using RepeatMasker to Identify Repetitive Elements in Genomic Sequences. In Curr Protoc Bioinformatics. Hoboken, NJ, USA: John Wiley & Sons, Inc., p. Unit 4.10. Available at: http://www.ncbi.nlm.nih.gov/pubmed/19274634