5 Open genome list

Tartary buckwheat Overview  The Tartary buckwheat (Fagopyrum tataricum) genome project was initiated through the Post genome Program by a consortium led by Yul Ho Kim, Su Jeong Kim, Hwang Bae Sohn, Sunghoon Lee, Dong-Ha Oh, Sin-Gi Park.   De novo genome sequencing of tartary buckwheat began in the early of 2014 and was completed lately in 2017. To obtain a high-quality draft genome assembly, we produced total 43.83 and 32.17 Gb sequences from Illumina paired-ends (PE) and Single-Molecule Real-Time (SMRT) sequencing platforms, respectively, which corresponded to 70x (Illumina PE) and 52x (SMRT) coverages. A hybrid assembly followed by scaffolding, gap-filling, and cleaning of redundancy resulted in a final draft assembly of 526.94 Mb in 2,566 scaffolds with 50% of the total sequence captured in 156 scaffolds larger than 886,968 bps (N50). We predicted total 43,771 putative protein-coding gene models occupying 19.33% of the genome, while 52.00% consisted of repetitive sequences and transposable elements, with Gypsy family long terminal repeat (LTR) retrotransposons being the most abundant class. We are currently preparing to publish a paper about the draft genome of tartary buckwheat.    Statistics Genome Approximately 526.94Mb arranged in 2,566 scaffolds Approximately 565.10Mb arranged in 4,433 contigs  Scaffold N50 = 886,968bp Contig N50 = 463,432bp 137 scaffolds larger than 1 Mbps, with above 50% of the genome in 156 scaffolds   Loci Total 43,771 putative protein-coding gene models were predicted.     Sequencing, Assembly, and Annotation Genome sequencing  We prepared both short read (Illumina)and long read (PacBio) libraries to cover the entire genome of entire genome of F. tataricum. Sequencing libraries were prepared from genomic DNA using Illumina HiSeq2500 (2 × 101 bp) and PacBio RSII platforms (>3Kb). In brief, a short insert (350 bp) paired-end (PE) library was constructed using TruSeq DNA library Prep Kit (Illumina) according to the manufacturer instructions. Single Molecule Real Time (SMRT) bell libraries were prepared from the large scale amplified cDNA as recommended by Pacific Biosciences (Palo Alto, U.S.A). SMRT bell templates were bound to polymerase using the DNA polymerase binding kit P6 v2 primers.    How was the assembly generated?  Whole genome de novo assembly for F. tataricum was performed via hybrid approach as follows: Long SMRT sequencing reads were assembled using Fast Alignment andCONsensus (FALCON) (Chin et al., 2016), whereas 350-bp short insert reads were assembledusing SOAPdenovo2 (Luo et al., 2012) with default parameters. Before assembly, all Illuminareads were subjected to preprocessing (adapter, quality, duplicates trimming). The initialcontigs were merged two assemblies using HaploMerger2 (Huang et al., 2017). Both shortand long reads were then used to construct scaffolds with SSPACE software (Boetzer et al.,2011) followed by gaps were filled with the short read data using GapFiller (Nadalin et al.,2012). We used CoGE SynMap (Lyons et al., 2008) and LASTZ (Kiełbasa et al., 2011) todetect and filter out redundant genomic regions (>98% sequence identity over >7Kb) togenerate the final draft assembly. The hybrid assembly resulted in a final draft assembly of 526.94 Mb in 2,566 scaffolds with 50% of the total sequence captured in 156 scaffolds larger than 886,968 bps (N50).   Is it accurate?  To test the accuracy of the genome assembly, we applied classical Sanger sequencing methods on two BAC clones of 121.85Kb and 61.50Kb that contain gene loci for the homologs of two previously known Fagopyrum FLS coding sequences. The 121.85Kb BAC clone (“29-J17”) contained a gene locus for FtFLS1 (NCBI GenBank ID: JF27561), while the 61.50Kb BAC clone (“32-I01”) included a locus for a partial sequence of putative FLS (GenBank ID: HM357805). Both BAC clone sequences, assembled from contigs generated by Sanger sequencing, showed >99% sequence identity with their corresponding genomic regions in the draft genome assembly.   Gene prediction  We predicted gene models in the draft genomes of F. tataricum cv. Daegwan by combining evidence from transcriptome and protein sequence alignments with ab initio prediction on repeat-masked genome sequences. GeneMark-ET (Lomsadze et al., 2014) was used to perform iterative training and to generate initial gene structures with RNA-Seq data information. AUGUSTUS (Stanke and Morgenstern, 2005) was further used to perform  de novo prediction with gene models trained by GeneMark-ET, with exon-intron boundary information predicted by transcriptome and protein sequence alignments. We used TopHat (Trapnell et al., 2012) for RNA-Seq alignment and Exonerate(Slater and Birney, 2005) for protein sequence alignment with similar species sequences. We annotated deduced protein sequences through BLASTP searches with an e-value cutoff of 1e-10 460 against NCBI  non-redundant database, UniProt, and Interproscan. Occurrence and frequency of repeats, including retrotransposons, DNA transposons, microsatellites, and other repeats, were screened using RepeatMasker (Tarailo-Graovac and Chen, 2009). Further, the repeat masked scaffolds were used for gene prediction as described above.   Is it complete?  Compared to the draft genome of F. tataricum cv. Pinku, the problem was that the number of annotated genes and duplicated BUSCO was high in the draft genome of F. tataricum cv. Daegwan. And so, we are currently working on improving the draft genome of F. tataricum cv. Daegwan.     Contacts Yul Ho Kim (email: kimyuh77@korea.kr) Highland Agriculture Research Institute, National Institute of Crop Science, Rural Development Administration, Pyeongchang 25342, Korea   References:  Boetzer, M., Henkel, C. V., Jansen, H.J., Butler, D. and Pirovano, W. (2011) Scaffolding preassembled contigs using SSPACE. Bioinformatics, 27, 578–579. Available at: http://www.ncbi.nlm.nih.gov/pubmed/21149342 Chin, C.-S., Peluso, P., Sedlazeck, F.J., et al. (2016) Phased diploid genome assembly with single-molecule real-time sequencing. Nat Methods, 13, 1050–1054. Available at: http://www.ncbi.nlm.nih.gov/pubmed/27749838  Huang, J., Deng, J., Shi, T., et al. (2017) Global transcriptome analysis and identification of genes involved in nutrients accumulation during seed development of rice tartary buckwheat (Fagopyrum Tararicum). Sci Rep, 7, 1–14  Kiełbasa, S.M., Wan, R., Sato, K., Horton, P. and Frith, M.C. (2011) Adaptive seeds tame genomic sequence comparison. Genome Res, 21, 487–493. Available at: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3044862&tool=pmcentrez &rendertype=abstract Lomsadze, A., Burns, P.D. and Borodovsky, M. (2014) Integration of mapped RNA-Seq reads into automatic training of eukaryotic gene finding algorithm. Nucleic Acids Res, 42, e119. Available at: https://academic.oup.com/nar/article-lookup/doi/10.1093/nar/gku557 Luo, R., Liu, B., Xie, Y., et al. (2012) SOAPdenovo2: an empirically improved memory efficient short-read de novo assembler. Gigascience, 1, 18. Available at: http://www.ncbi.nlm.nih.gov/pubmed/23587118  Lyons, E., Pedersen, B., Kane, J., et al. (2008) Finding and Comparing Syntenic Regions among Arabidopsis and the Outgroups Papaya, Poplar, and Grape: CoGe with Rosids. Plant Physiol, 148, 1772–1781. Nadalin, F., Vezzi, F. and Policriti, A. (2012) GapFiller: a de novo assembly approach to fill the gap within paired reads. BMC Bioinformatics, 13 Suppl 1, S8. Available at: http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-13-S14-S8 Stanke, M. and Morgenstern, B. (2005) AUGUSTUS: a web server for gene prediction in eukaryotes that allows user-defined constraints. Nucleic Acids Res, 33, W465-7. Available at: https://academic.oup.com/nar/article-lookup/doi/10.1093/nar/gki458  Slater, G.S.C. and Birney, E. (2005) Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics, 6, 31. Available at: http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-6-31 Tarailo-Graovac, M. and Chen, N. (2009) Using RepeatMasker to Identify Repetitive Elements in Genomic Sequences. In Curr Protoc Bioinformatics. Hoboken, NJ, USA: John Wiley & Sons, Inc., p. Unit 4.10. Available at: http://www.ncbi.nlm.nih.gov/pubmed/19274634

Antheraea yamamai (Japanese Oak Silkmoth) Overview Antheraea yamamai, also known as the Japanese oak silk moth, is a wild species of silk moth. Silk produced by A. yamamai, referred to as tensan silk, differs drastically from common silk produced from the domesticated silkworm, Bombyx mori. Silk moths can be categorized into two families- Bombycidae and Saturniidae. Saterniidae has been estimated to contain approximately 1,861 species with 162 genera and is known as the largest family in Lepidoptera. Among the many species in family Saturniidae, only a few species, including A. yamamai, can be utilized for silk production. For whole genome sequencing, we selected one male sample (Ay-7-male1) from a breeding line (Ay-7) of A. yamamai raised at the National Academy of Agricultural Science, Rural Development Administration, Korea. A total of 147Gb of genomic data and 76Gb of transcriptomic data was generated for this study. We present the genome sequence of A. yamamai, the first published genome in family Saturniidae, with gene expression data collected from ten different body organ tissues. Statistics   Genome   A total of 147G base pairs using Illumina and Pacbio sequencing platforms were generated. Approximately 210-fold coverage based on the 700 Mb estimated genome size of A. yamamai. The assembled genome of A. yamamai was 656 Mb(>2kb) with 3,675 scaffolds.   The N50 length of assembly was 739 Kb with 34.07% GC ratio.   Identified repeat elements covered 37.33% of the total genome and the completeness of the constructed genome assembly was estimated to be 96.7% by BUSCO v2 analysis.       Loci   A total of 76Gb of transcriptomic data was generated for this study.   A total of 21,124 genes were identified using Evidence Modeler based on the gene prediction results obtained from 3 different methods (ab initio, RNA-seq based, known-gene based).       Assembly   Before conducting genome assembly, we conducted k-mer distribution analysis using a 350bp paired-end library in order to estimate the size and characteristics of the A. yamamai genome. The 19-mer distribution of A. yamamai genome using a 350 bp paired-end library.   In the 19-mer distribution analysis, the genome size of A. yamamai was estimated to be 709Mb. Next, we conducted error correction on Illumina paired-end libraries using the error correction module of Allpaths-LG before the initial contig assembly process (ALLPATHS-LG , RRID:SCR_010742). After error correction, initial contig assembly with 350bp and 700bp libraries was conducted using SOAP denovo2 with the parameter option set at K=19; this approach showed the best assembly statistics compared to other assemblers and parameters (SOAPdenovo2 , RRID:SCR_014986).   At each scaffolding step, SOAP Gapcloser[21] with -l 155 and -p 31 parameters was repeatedly used to close the gaps within each scaffold.   After scaffolding was performed using SSPACE-LongRead with Illumina synthetic long read data, the total number of assembled scaffolds was effectively reduced from 398,446 to 24,558. The average scaffold length was also extended from 1.7 Kb to 24.8 Kb. However, there was no impressive improvement in N50 length (approximately 91 Kb to 112 Kb) of assembled scaffolds.   After final scaffolding processing using Pacbio long reads, the number of scaffolds was reduced to 3,675 and N50 length was effectively extended from 112 Kb to 739 Kb.       Gene Prediction   Three different algorithms were used for gene prediction of the A. yamamai genome: ab initio, RNA-seq transcript based, and protein homology-based approaches.   For RNA-seq transcript based prediction, generated transcriptome data from ten organ tissues of A. yamamai were aligned to the assembled genome and gene information was predicted using Cufflinks[44](Cufflinks , RRID:SCR_014597). The longest CDS sequences were identified from Cufflinks results using Transdecoder. For the homology-based approach, all known genes of order Lepidoptera in the NCBI database were aligned using PASA. The final gene set of A. yamamai genome contains 21,124 genes.   The average gene length was 8,331 bp with a 38.76% GC ratio and the number of exons per gene was 4.44. To identify the function of predicted genes, Swiss-Prot, Uniref100, NCBI NR database, and gene information of B. mori and D. melanogaster was employed for sequence similarity search using blastp.       Contacts   Seong-Ryul Kim (email : ksr319@korea.kr)   Seong-Wan Kim (email:tarupa@korea.kr)   Reference Publication   Kim SR, Kwak W, Kim H, Caetano-Anolles K, Kim KY, Kim SB, Choi KH, Kim SW, Hwang JS, Kim M, Kim I, Goo TW5 Park SW. Genome sequence of the Japanese oak silk moth, Antheraea yamamai: the first draft genome in the family Saturniidae. Gigascience. 2018 Jan 1;7(1):1-11. doi: 10.1093/gigascience/gix113.  

Hypsizygus marmoreus Haemi 51987-8 Overview We sequenced a reference strain of H. marmoreus (Haemi 51987–8). We evaluated various assembly strategies, and as a result the Allpaths and PBJelly produced the best assembly. The resulting genome was 42.7 Mbp in length and annotated with 16,627 gene models. A putative gene (Hypma_04324) encoding the antifungal and antiproliferative hypsin protein with 75% sequence identity with the previously known N-terminal sequence was identified. Carbohydrate active enzyme analysis displayed the typical feature of white-rot fungi where auxiliary activity and carbohydrate-binding modules were enriched. The genome annotation revealed four terpene synthase genes responsible for terpenoid biosynthesis. From the gene tree analysis, we identified that terpene synthase genes can be classified into six clades. Four terpene synthase genes of H. marmoreus belonged to four different groups that implies they may be involved in the synthesis of different structures of terpenes. A terpene synthase gene cluster was well-conserved in Agaricomycetes genomes, which contained known biosynthesis and regulatory genes.   Statistics Genome Approximately 42.71 Mb arranged in 235 scaffolds Approximately 42.68 Mb arranged in 278 contigs (~0.06% gap) Scaffold N50 (L50) = 17 (764.8 kbp) Contig N50 (L50) = 20 (621.3 kbp) 83 scaffolds larger than 50 Kbp, with 96.68% of the genome in scaffolds larger than 50 Kbp   Assembly We selected the Allpaths+PBJelly assembly for further analyses. The final assembly had a size of 42,710,661 bp including 235 scaffolds/278 contigs with 287.3× sequence coverage. The GC percentage was 49.64%. We estimated the genome size as 43.0 Mbp using the k-mer frequency calculation of Illumina paired-end reads.   NCBI GenBank Recoreds   Release Date: 25/7/2018              BioProject: PRJNA312409             Accession ID: LUEZ00000000   Is it complete? Genome completeness was calculated using BUSCO v3.0 at the gene level. Only 5 of 1335 single-copy entries were missing, indicating >99% genome completeness. Its RNA-seq reads were mapped into the genome, where 97.27% of the reads were aligned.   Is it accurate? The transcriptome assembly is aligned with the genome with >99% identity. Also, 97.35% of predicted genes are complete, implying accurate genome assembly without sequencing or assembly error.   What about polyploidy? It's haploid genome.   Gene prediction Using the FunGAP pipeline (Min, 2017), we predicted 16,627 protein-coding genes with an average size of 1586.1 nt. Of these protein-coding genes, 14,179 genes (85.3%) were supported by assembled transcripts, and this included 10,522 (63.3%) highly supported genes (> 90% coverage). The quality of the gene prediction was evaluated by comparing the predictions of three programs inside the FunGAP pipeline: Augustus 3.2.1 (Stanke, 2005), Braker 1.8 (Hoff, 2016), and Maker 2.31.8 (Cantarel, 2008). Approximately half of the predicted genes were functionally annotated; in total, 7786 genes (46.8%) were annotated using Pfam domains, and 7447 genes (44.8%) were annotated using SwissProt. The dominant functions included WD, F-box, protein kinase, cytochrome P450, and major facilitator superfamily domains, similarly as observed in other mushroom genomes (Gupta, 2018, Yuan, 2017). The genome contained 1793 genes encoding secreted proteins. We identified 1262 noncoding RNA elements containing 171 tRNAs, including 9 selenocysteine tRNAs, 191 small nucleolar RNAs (snoRNAs) from 127 different families, and 224 microRNAs from 90 different families.   References: Min B, Grigoriev IV, Choi IG. FunGAP: Fungal Genome Annotation Pipeline using evidence-based gene model evaluation. Bioinformatics (Oxford, England). 2017;33(18):2936–7. 24. Stanke M, Morgenstern B. AUGUSTUS: a web server for gene prediction in eukaryotes that allows user-defined constraints. Nucleic Acids Res. 2005; 33(Web Server):W465–7.   Hoff KJ, Lange S, Lomsadze A, Borodovsky M, Stanke M. BRAKER1: unsupervised RNA-Seq-based genome annotation with GeneMark-ET and AUGUSTUS. Bioinformatics (Oxford, England). 2016;32(5):767–9.   Cantarel BL, Korf I, Robb SM, Parra G, Ross E, Moore B, Holt C, Sanchez Alvarado A, Yandell M. MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes. Genome Res. 2008;18(1):188–96.   Gupta DK, Ruhl M, Mishra B, Kleofas V, Hofrichter M, Herzog R, Pecyna MJ, Sharma R, Kellner H, Hennicke F, et al. The genome sequence of the commercially cultivated mushroom Agrocybe aegerita reveals a conserved repertoire of fruiting-related genes and a versatile suite of biopolymerdegrading enzymes. BMC Genomics. 2018;19(1):48.   Yuan Y, Wu F, Si J, Zhao YF, Dai YC. Whole genome sequence of Auricularia heimuer (Basidiomycota, Fungi), the third most important cultivated mushroom worldwide. Genomics. 2017. https://doi.org/10.1016/j.ygeno. 2017.12.013   Contacts jkim5aug@korea.kr; igchoi@korea.ac.kr   Reference Publication(s) Genomic discovery of the hypsin gene and biosynthetic pathways for terpenoids in Hypsizygus marmoreus. BMC Genomics 19:789 (2018)

Platycodon grandiflorus(balloonflower) Overview The bellflower (Platycodon grandiflorus) belongs to the bellflower family (Campanulaceae). Its root has been used as a traditional medicine and also a popular food additive with therapeutic effects on bronchitis, asthma, tonsillitis, pulmonary tuberculosis in East Asia for over 2000 years. The most important bioactive components of P. grandiflorus are platycosides, especially platycodin D. A whole-genome assembly of P. grandiflorus accompanied by its transcriptome and methylome data. The genome-wide analysis reveals the evolution of P. grandiflorus specialized in platycoside biosynthesis as a medicinal herb. In particular, the triterpenoid saponin biosynthesis-related genes show clues on species-specific selection of key player genes towards platycoside biosynthesis and their function.   Genome assembly and annotation of P. grandiflorus Jangbaek-doraji, a cultivar of P. grandiflorus was used for whole-genome sequencing after four generations of self-fertilization. The karyotype analysis confirmed the diploid genome of P. grandiflorus with four metacentric and five sub-metacentric chromosome pairs. The k-mer analysis estimated the genome size to be approximately 694.4 Mb. We produced 474.5x sequencing coverage of Illumina short reads and 5.7x TruSeq synthetic long-reads (TSLRs). A hybrid assembly resulted in a 680.1 Mb draft genome with 4,816 scaffolds. It covered 98.4% of the estimated P. grandiflorus genome size. The assembly captured 92.6% of the complete BUSCOs, showing few fragmented and missing BUSCO genes indicating its high quality assembly construction. Its assembly quality was also assessed by mapping short reads to itself and 98% of them were successfully aligned to the assembly with 25x sequencing coverage depth. The genome annotation of P. grandiflorus enabled us to identify candidate genes underlying secondary metabolite biosynthesis. In the annotation, we predicted 40,018 non-redundant protein-coding genes with an average length of 5,019 bp from repeat-masked genome using evidence-driven gene prediction methods coupled with ab initio prediction.    

Grifola frondosa ASI9006-11 Overview For whole-genome sequencing, the monokaryotic strain G. frondosa 9006-11 (KCTC 46451) was used. The genomic DNA was extracted from the vegetative mycelia using a plant genomic MagExtractorTM kit (TOYOBO NPK-501) as manufacturer’s instructions and sequenced in the PacBio single molecule real-time (SMRT) sequencing platform. From the four SMRT cells, we obtained 601,168 raw subreads with a total length of 4 Gb. The low-quality reads were filtered to produce 314,541 high-quality subreads with an average length of 12,229 bp for genome assembly. De novo assembly was performed using the Falcon assembly tool kit 0.2 and SMRT analysis 2.3.0. We prepared the cDNA library for RNA-seq from the mycelia of monokaryote 9006-11. The library was sequenced by Illumina HiSeq 2500 with the TruSeq stranded mRNA prep kit and generated 46 million reads. The low quality bases (<20 Q-score) and short reads (<20 bp in length) were trimmed and excluded. The genes were predicted using Augustus 3.2.1, Braker 1.8 (http://exon.gatech.edu/genemark/braker1.html), and Maker 2.31.8. The resulting genome was 39.3 Mbp in length and annotated with 15,039 gene models.   Statistics Genome Approximately 39.3Mb arranged in 127 scaffolds Scaffold N50 (L50) = 8 (1.8 Mbp) 45 scaffolds larger than 50 Kbp, with 98.18% of the genome in scaffolds larger than 50 Kbp Assembly De novo assembly was performed using the Falcon assembly tool kit 0.2 and SMRT analysis 2.3.0. We prepared the cDNA library for RNA-seq from the mycelia of monokaryote 9006-11.   NCBI GenBank Records   Release Date: 12/7/2016        BioProject: PRJNA314283        Accession ID: LUGG01000000 Is it complete? Genome completeness was calculated using BUSCO v3.0 at the gene level. The genome has 91.31% completeness where 1005 of 1335 BUSCO entries are complete and single-copy. Is it accurate? Also, 98.50% of predicted genes are complete, implying accurate genome assembly without sequencing or assembly error. What about polyploidy? It's haploid genome. Gene prediction The genes were predicted using Augustus 3.2.1 (Stanke et al., 2006), Braker 1.8 (http://exon.gatech.edu/genemark/braker1.html), and Maker 2.31.8 (Holt and Yandell, 2011).   References: Stanke M, Keller O, Gunduz I, Hayes A, Waack S, Morgenstern B. AUGUSTUS: ab initio prediction of alternative transcripts. Nucleic Acids Res. 2006 Jul 1;34(Web Server issue):W435-9.   Holt C1, Yandell M. MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects. BMC Bioinformatics. 2011 Dec 22;12:491. doi: 10.1186/1471-2105-12-491.   Contacts igchoi@korea.ac.kr   Reference Publication(s)