Home > Portal > Genome > Genome User list
The Tartary buckwheat (Fagopyrum tataricum) genome project was initiated through the Post genome Program by a consortium led by Yul Ho Kim, Su Jeong Kim, Hwang Bae Sohn, Sunghoon Lee, Dong-Ha Oh, Sin-Gi Park.
De novo genome sequencing of tartary buckwheat began in the early of 2014 and was completed lately in 2017. To obtain a high-quality draft genome assembly, we produced total 43.83 and 32.17 Gb sequences from Illumina paired-ends (PE) and Single-Molecule Real-Time (SMRT) sequencing platforms, respectively, which corresponded to 70x (Illumina PE) and 52x (SMRT) coverages. A hybrid assembly followed by scaffolding, gap-filling, and cleaning of redundancy resulted in a final draft assembly of 526.94 Mb in 2,566 scaffolds with 50% of the total sequence captured in 156 scaffolds larger than 886,968 bps (N50). We predicted total 43,771 putative protein-coding gene models occupying 19.33% of the genome, while 52.00% consisted of repetitive sequences and transposable elements, with Gypsy family long terminal repeat (LTR) retrotransposons being the most abundant class. We are currently preparing to publish a paper about the draft genome of tartary buckwheat.
Approximately 526.94Mb arranged in 2,566 scaffolds
Approximately 565.10Mb arranged in 4,433 contigs
Scaffold N50 = 886,968bp
Contig N50 = 463,432bp
137 scaffolds larger than 1 Mbps, with above 50% of the genome in 156 scaffolds
Total 43,771 putative protein-coding gene models were predicted.
Sequencing, Assembly, and Annotation
We prepared both short read (Illumina)and long read (PacBio) libraries to cover the entire genome of entire genome of F. tataricum. Sequencing libraries were prepared from genomic DNA using Illumina HiSeq2500 (2 × 101 bp) and PacBio RSII platforms (>3Kb). In brief, a short insert (350 bp) paired-end (PE) library was constructed using TruSeq DNA library Prep Kit (Illumina) according to the manufacturer instructions. Single Molecule Real Time (SMRT) bell libraries were prepared from the large scale amplified cDNA as recommended by Pacific Biosciences (Palo Alto, U.S.A). SMRT bell templates were bound to polymerase using the DNA polymerase binding kit P6 v2 primers.
How was the assembly generated?
Whole genome de novo assembly for F. tataricum was performed via hybrid approach as follows: Long SMRT sequencing reads were assembled using Fast Alignment andCONsensus (FALCON) (Chin et al., 2016), whereas 350-bp short insert reads were assembledusing SOAPdenovo2 (Luo et al., 2012) with default parameters. Before assembly, all Illuminareads were subjected to preprocessing (adapter, quality, duplicates trimming). The initialcontigs were merged two assemblies using HaploMerger2 (Huang et al., 2017). Both shortand long reads were then used to construct scaffolds with SSPACE software (Boetzer et al.,2011) followed by gaps were filled with the short read data using GapFiller (Nadalin et al.,2012). We used CoGE SynMap (Lyons et al., 2008) and LASTZ (Kiełbasa et al., 2011) todetect and filter out redundant genomic regions (>98% sequence identity over >7Kb) togenerate the final draft assembly. The hybrid assembly resulted in a final draft assembly of 526.94 Mb in 2,566 scaffolds with 50% of the total sequence captured in 156 scaffolds larger than 886,968 bps (N50).
Is it accurate?
To test the accuracy of the genome assembly, we applied classical Sanger sequencing methods on two BAC clones of 121.85Kb and 61.50Kb that contain gene loci for the homologs of two previously known Fagopyrum FLS coding sequences. The 121.85Kb BAC clone (“29-J17”) contained a gene locus for FtFLS1 (NCBI GenBank ID: JF27561), while the 61.50Kb BAC clone (“32-I01”) included a locus for a partial sequence of putative FLS (GenBank ID: HM357805). Both BAC clone sequences, assembled from contigs generated by Sanger sequencing, showed >99% sequence identity with their corresponding genomic regions in the draft genome assembly.
We predicted gene models in the draft genomes of F. tataricum cv. Daegwan by combining evidence from transcriptome and protein sequence alignments with ab initio prediction on repeat-masked genome sequences. GeneMark-ET (Lomsadze et al., 2014) was used to perform iterative training and to generate initial gene structures with RNA-Seq data information. AUGUSTUS (Stanke and Morgenstern, 2005) was further used to perform de novo prediction with gene models trained by GeneMark-ET, with exon-intron boundary information predicted by transcriptome and protein sequence alignments. We used TopHat (Trapnell et al., 2012) for RNA-Seq alignment and Exonerate(Slater and Birney, 2005) for protein sequence alignment with similar species sequences. We annotated deduced protein sequences through BLASTP searches with an e-value cutoff of 1e-10 460 against NCBI non-redundant database, UniProt, and Interproscan. Occurrence and frequency of repeats, including retrotransposons, DNA transposons, microsatellites, and other repeats, were screened using RepeatMasker (Tarailo-Graovac and Chen, 2009). Further, the repeat masked scaffolds were used for gene prediction as described above.
Is it complete?
Compared to the draft genome of F. tataricum cv. Pinku, the problem was that the number of annotated genes and duplicated BUSCO was high in the draft genome of F. tataricum cv. Daegwan. And so, we are currently working on improving the draft genome of F. tataricum cv. Daegwan.
Yul Ho Kim (email: firstname.lastname@example.org)
Highland Agriculture Research Institute, National Institute of Crop Science, Rural Development Administration, Pyeongchang 25342, Korea
Boetzer, M., Henkel, C. V., Jansen, H.J., Butler, D. and Pirovano, W. (2011) Scaffolding preassembled contigs using SSPACE. Bioinformatics, 27, 578–579. Available at: http://www.ncbi.nlm.nih.gov/pubmed/21149342
Chin, C.-S., Peluso, P., Sedlazeck, F.J., et al. (2016) Phased diploid genome assembly with single-molecule real-time sequencing. Nat Methods, 13, 1050–1054. Available at: http://www.ncbi.nlm.nih.gov/pubmed/27749838
Huang, J., Deng, J., Shi, T., et al. (2017) Global transcriptome analysis and identification of genes involved in nutrients accumulation during seed development of rice tartary buckwheat (Fagopyrum Tararicum). Sci Rep, 7, 1–14
Kiełbasa, S.M., Wan, R., Sato, K., Horton, P. and Frith, M.C. (2011) Adaptive seeds tame genomic sequence comparison. Genome Res, 21, 487–493. Available at: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3044862&tool=pmcentrez &rendertype=abstract
Lomsadze, A., Burns, P.D. and Borodovsky, M. (2014) Integration of mapped RNA-Seq reads into automatic training of eukaryotic gene finding algorithm. Nucleic Acids Res, 42, e119. Available at: https://academic.oup.com/nar/article-lookup/doi/10.1093/nar/gku557
Luo, R., Liu, B., Xie, Y., et al. (2012) SOAPdenovo2: an empirically improved memory efficient short-read de novo assembler. Gigascience, 1, 18. Available at: http://www.ncbi.nlm.nih.gov/pubmed/23587118
Lyons, E., Pedersen, B., Kane, J., et al. (2008) Finding and Comparing Syntenic Regions among Arabidopsis and the Outgroups Papaya, Poplar, and Grape: CoGe with Rosids. Plant Physiol, 148, 1772–1781.
Nadalin, F., Vezzi, F. and Policriti, A. (2012) GapFiller: a de novo assembly approach to fill the gap within paired reads. BMC Bioinformatics, 13 Suppl 1, S8. Available at: http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-13-S14-S8
Stanke, M. and Morgenstern, B. (2005) AUGUSTUS: a web server for gene prediction in eukaryotes that allows user-defined constraints. Nucleic Acids Res, 33, W465-7. Available at: https://academic.oup.com/nar/article-lookup/doi/10.1093/nar/gki458
Slater, G.S.C. and Birney, E. (2005) Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics, 6, 31. Available at: http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-6-31
Tarailo-Graovac, M. and Chen, N. (2009) Using RepeatMasker to Identify Repetitive Elements in Genomic Sequences. In Curr Protoc Bioinformatics. Hoboken, NJ, USA: John Wiley & Sons, Inc., p. Unit 4.10. Available at: http://www.ncbi.nlm.nih.gov/pubmed/19274634
Antheraea yamamai, also known as the Japanese oak silk moth, is a wild species of silk moth. Silk produced by A. yamamai, referred to as tensan silk, differs drastically from common silk produced from the domesticated silkworm, Bombyx mori. Silk moths can be categorized into two families- Bombycidae and Saturniidae. Saterniidae has been estimated to contain approximately 1,861 species with 162 genera and is known as the largest family in Lepidoptera. Among the many species in family Saturniidae, only a few species, including A. yamamai, can be utilized for silk production. For whole genome sequencing, we selected one male sample (Ay-7-male1) from a breeding line (Ay-7) of A. yamamai raised at the National Academy of Agricultural Science, Rural Development Administration, Korea. A total of 147Gb of genomic data and 76Gb of transcriptomic data was generated for this study. We present the genome sequence of A. yamamai, the first published genome in family Saturniidae, with gene expression data collected from ten different body organ tissues.
A total of 147G base pairs using Illumina and Pacbio sequencing platforms were generated. Approximately 210-fold coverage based on the 700 Mb estimated genome size of A. yamamai. The assembled genome of A. yamamai was 656 Mb(>2kb) with 3,675 scaffolds.
The N50 length of assembly was 739 Kb with 34.07% GC ratio.
Identified repeat elements covered 37.33% of the total genome and the completeness of the constructed genome assembly was estimated to be 96.7% by BUSCO v2 analysis.
A total of 76Gb of transcriptomic data was generated for this study.
A total of 21,124 genes were identified using Evidence Modeler based on the gene prediction results obtained from 3 different methods (ab initio, RNA-seq based, known-gene based).
Before conducting genome assembly, we conducted k-mer distribution analysis using a 350bp paired-end library in order to estimate the size and characteristics of the A. yamamai genome. The 19-mer distribution of A. yamamai genome using a 350 bp paired-end library.
In the 19-mer distribution analysis, the genome size of A. yamamai was estimated to be 709Mb. Next, we conducted error correction on Illumina paired-end libraries using the error correction module of Allpaths-LG before the initial contig assembly process (ALLPATHS-LG , RRID:SCR_010742). After error correction, initial contig assembly with 350bp and 700bp libraries was conducted using SOAP denovo2 with the parameter option set at K=19; this approach showed the best assembly statistics compared to other assemblers and parameters (SOAPdenovo2 , RRID:SCR_014986).
At each scaffolding step, SOAP Gapcloser with -l 155 and -p 31 parameters was repeatedly used to close the gaps within each scaffold.
After scaffolding was performed using SSPACE-LongRead with Illumina synthetic long read data, the total number of assembled scaffolds was effectively reduced from 398,446 to 24,558. The average scaffold length was also extended from 1.7 Kb to 24.8 Kb. However, there was no impressive improvement in N50 length (approximately 91 Kb to 112 Kb) of assembled scaffolds.
After final scaffolding processing using Pacbio long reads, the number of scaffolds was reduced to 3,675 and N50 length was effectively extended from 112 Kb to 739 Kb.
Three different algorithms were used for gene prediction of the A. yamamai genome: ab initio, RNA-seq transcript based, and protein homology-based approaches.
For RNA-seq transcript based prediction, generated transcriptome data from ten organ tissues of A. yamamai were aligned to the assembled genome and gene information was predicted using Cufflinks(Cufflinks , RRID:SCR_014597). The longest CDS sequences were identified from Cufflinks results using Transdecoder. For the homology-based approach, all known genes of order Lepidoptera in the NCBI database were aligned using PASA. The final gene set of A. yamamai genome contains 21,124 genes.
The average gene length was 8,331 bp with a 38.76% GC ratio and the number of exons per gene was 4.44. To identify the function of predicted genes, Swiss-Prot, Uniref100, NCBI NR database, and gene information of B. mori and D. melanogaster was employed for sequence similarity search using blastp.
Seong-Ryul Kim (email : email@example.com)
Kim SR, Kwak W, Kim H, Caetano-Anolles K, Kim KY, Kim SB, Choi KH, Kim SW, Hwang JS, Kim M, Kim I, Goo TW5 Park SW. Genome sequence of the Japanese oak silk moth, Antheraea yamamai: the first draft genome in the family Saturniidae. Gigascience. 2018 Jan 1;7(1):1-11. doi: 10.1093/gigascience/gix113.
For whole-genome sequencing, the monokaryotic strain G. frondosa 9006-11 (KCTC 46451) was used. The genomic DNA was extracted from the vegetative mycelia using a plant genomic MagExtractorTM kit (TOYOBO NPK-501) as manufacturer’s instructions and sequenced in the PacBio single molecule real-time (SMRT) sequencing platform. From the four SMRT cells, we obtained 601,168 raw subreads with a total length of 4 Gb. The low-quality reads were filtered to produce 314,541 high-quality subreads with an average length of 12,229 bp for genome assembly. De novo assembly was performed using the Falcon assembly tool kit 0.2 and SMRT analysis 2.3.0. We prepared the cDNA library for RNA-seq from the mycelia of monokaryote 9006-11. The library was sequenced by Illumina HiSeq 2500 with the TruSeq stranded mRNA prep kit and generated 46 million reads. The low quality bases (<20 Q-score) and short reads (<20 bp in length) were trimmed and excluded. The genes were predicted using Augustus 3.2.1, Braker 1.8 (http://exon.gatech.edu/genemark/braker1.html), and Maker 2.31.8. The resulting genome was 39.3 Mbp in length and annotated with 15,039 gene models.
Approximately 39.3Mb arranged in 127 scaffolds
Scaffold N50 (L50) = 8 (1.8 Mbp)
45 scaffolds larger than 50 Kbp, with 98.18% of the genome in scaffolds larger than 50 Kbp
De novo assembly was performed using the Falcon assembly tool kit 0.2 and SMRT analysis 2.3.0. We prepared the cDNA library for RNA-seq from the mycelia of monokaryote 9006-11.
NCBI GenBank Records
Release Date: 12/7/2016 BioProject: PRJNA314283 Accession ID: LUGG01000000
Is it complete?
Genome completeness was calculated using BUSCO v3.0 at the gene level. The genome has 91.31% completeness where 1005 of 1335 BUSCO entries are complete and single-copy.
Is it accurate?
Also, 98.50% of predicted genes are complete, implying accurate genome assembly without sequencing or assembly error.
What about polyploidy?
It's haploid genome.
The genes were predicted using Augustus 3.2.1 (Stanke et al., 2006), Braker 1.8 (http://exon.gatech.edu/genemark/braker1.html), and Maker 2.31.8 (Holt and Yandell, 2011).
Stanke M, Keller O, Gunduz I, Hayes A, Waack S, Morgenstern B. AUGUSTUS: ab initio prediction of alternative transcripts. Nucleic Acids Res. 2006 Jul 1;34(Web Server issue):W435-9.
Holt C1, Yandell M. MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects. BMC Bioinformatics. 2011 Dec 22;12:491. doi: 10.1186/1471-2105-12-491.
We sequenced a reference strain of H. marmoreus (Haemi 51987–8). We evaluated various assembly strategies, and as a result the Allpaths and PBJelly produced the best assembly. The resulting genome was 42.7 Mbp in length and annotated with 16,627 gene models. A putative gene (Hypma_04324) encoding the antifungal and antiproliferative hypsin protein with 75% sequence identity with the previously known N-terminal sequence was identified. Carbohydrate active enzyme analysis displayed the typical feature of white-rot fungi where auxiliary activity and carbohydrate-binding modules were enriched. The genome annotation revealed four terpene synthase genes responsible for terpenoid biosynthesis. From the gene tree analysis, we identified that terpene synthase genes can be classified into six clades. Four terpene synthase genes of H. marmoreus belonged to four different groups that implies they may be involved in the synthesis of different structures of terpenes. A terpene synthase gene cluster was well-conserved in Agaricomycetes genomes, which contained known biosynthesis and regulatory genes.
Approximately 42.71 Mb arranged in 235 scaffolds
Approximately 42.68 Mb arranged in 278 contigs (~0.06% gap)
Scaffold N50 (L50) = 17 (764.8 kbp)
Contig N50 (L50) = 20 (621.3 kbp)
83 scaffolds larger than 50 Kbp, with 96.68% of the genome in scaffolds larger than 50 Kbp
We selected the Allpaths+PBJelly assembly for further analyses. The final assembly had a size of 42,710,661 bp including 235 scaffolds/278 contigs with 287.3× sequence coverage. The GC percentage was 49.64%. We estimated the genome size as 43.0 Mbp using the k-mer frequency calculation of Illumina paired-end reads.
NCBI GenBank Recoreds
Release Date: 25/7/2018 BioProject: PRJNA312409 Accession ID: LUEZ00000000
Is it complete?
Genome completeness was calculated using BUSCO v3.0 at the gene level. Only 5 of 1335 single-copy entries were missing, indicating >99% genome completeness. Its RNA-seq reads were mapped into the genome, where 97.27% of the reads were aligned.
Is it accurate?
The transcriptome assembly is aligned with the genome with >99% identity. Also, 97.35% of predicted genes are complete, implying accurate genome assembly without sequencing or assembly error.
What about polyploidy?
It's haploid genome.
Using the FunGAP pipeline (Min, 2017), we predicted 16,627 protein-coding genes with an average size of 1586.1 nt. Of these protein-coding genes, 14,179 genes (85.3%) were supported by assembled transcripts, and this included
10,522 (63.3%) highly supported genes (> 90% coverage). The quality of the gene prediction was evaluated by comparing the predictions of three programs inside the FunGAP pipeline: Augustus 3.2.1 (Stanke, 2005), Braker 1.8 (Hoff, 2016), and Maker 2.31.8 (Cantarel, 2008). Approximately half of the predicted genes were functionally annotated; in total, 7786 genes (46.8%) were annotated using Pfam domains, and 7447 genes (44.8%) were annotated using SwissProt. The dominant functions included WD, F-box, protein kinase, cytochrome P450, and major facilitator superfamily domains, similarly as observed in other mushroom genomes (Gupta, 2018, Yuan, 2017). The genome contained 1793 genes encoding secreted proteins. We identified 1262 noncoding RNA elements containing 171 tRNAs, including 9 selenocysteine tRNAs, 191 small nucleolar RNAs (snoRNAs) from 127 different families, and 224 microRNAs from 90 different families.
Min B, Grigoriev IV, Choi IG. FunGAP: Fungal Genome Annotation Pipeline using evidence-based gene model evaluation. Bioinformatics (Oxford, England). 2017;33(18):2936–7. 24. Stanke M, Morgenstern B. AUGUSTUS: a web server for gene prediction in eukaryotes that allows user-defined constraints. Nucleic Acids Res. 2005; 33(Web Server):W465–7.
Hoff KJ, Lange S, Lomsadze A, Borodovsky M, Stanke M. BRAKER1: unsupervised RNA-Seq-based genome annotation with GeneMark-ET and AUGUSTUS. Bioinformatics (Oxford, England). 2016;32(5):767–9.
Cantarel BL, Korf I, Robb SM, Parra G, Ross E, Moore B, Holt C, Sanchez Alvarado A, Yandell M. MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes. Genome Res. 2008;18(1):188–96.
Gupta DK, Ruhl M, Mishra B, Kleofas V, Hofrichter M, Herzog R, Pecyna MJ, Sharma R, Kellner H, Hennicke F, et al. The genome sequence of the commercially cultivated mushroom Agrocybe aegerita reveals a conserved repertoire of fruiting-related genes and a versatile suite of biopolymerdegrading enzymes. BMC Genomics. 2018;19(1):48.
Yuan Y, Wu F, Si J, Zhao YF, Dai YC. Whole genome sequence of Auricularia heimuer (Basidiomycota, Fungi), the third most important cultivated mushroom worldwide. Genomics. 2017. https://doi.org/10.1016/j.ygeno. 2017.12.013
Genomic discovery of the hypsin gene and biosynthetic pathways for terpenoids in Hypsizygus marmoreus. BMC Genomics 19:789 (2018)
Senna tora (L.) Roxb. (Cassia tora), a member of Leguminosae (subfamily Caesalpinoideae), is a semi-wild annual herb widely grown in different places of toropical and subtropical weather all around the World (https://ildis.org). S. tora is a rich resource of anthraquinones, flavonoids, and polysaccharides. So, seeds are extensively used for medicinal applications in gastrointestinal disorders, treatment of skin, and ailments ranging from simple cough, hypertension to diabetes. Despite of its useful applications, there has been little report of molecular and genomic studies of S. tora. To elucidate genes responsible for biosynthesis of anthraquinone in S. tora, the genome project has initiated through the National Agricultural Genome Program (NGAP) and National Institute of Agricultural Sciences (NAS) Program.
Genome v1.0 (NGAP: 2014.04.01 ~ 2017.12.31)
Approximately 525.3 Mb is assembled in 16,931contigs
Approximately 526.5 Mb is assembled in 4,513 scaffolds
Contig N50 = 250 kb, Longest contig = 1.32 Mb
Scaffold N50 = 2.70 Mb, Longest scaffold = 14.2 Mb
Loci v1.0 (NGAP: 2014.04.01 ~ 2017.12.31)
41,984 protein-coding genes have been predicted
Genome v2.0 (NAS Program: 2018.03.01 ~ 2020.12.31)
Approximately 502 Mb is assembled in 13 chromosomes, with 23.8 Mb of sequence in unmapped scaffolds
Approximately 526.4 Mb arranged in 732 contigs
Scaffold N50 = 41.7 Mb, Longest scaffold = 52.7 Mb
Contig N50 = 4.03 Mb, Longest contig = 14.9 Mb
Loci v2.0 (NAS Program: 2018.03.01 ~ 2020.12.31)
45,268 protein-coding genes have been predicted
The complete chloroplast genome of S. tora is 162,426 bp in size (Accession no. NC030193). The chloroplast genome harbors 110 annotated genes, including 77 protein-coding genes, 30 tRNA genes, and 4 rRNA genes. The complete mitochondrial genome of S. tora is 566,589 bp in length (Accession no. MF358693). A total of 63 genes are annotated including 36 protein-coding genes, 22 tRNA genes, and 5 rRNA genes.
Sang-Ho Kang (firstname.lastname@example.org)
Sang-Ho Kang, So Youn Won and Chang-Kug Kim (2019) The complete mitochondrial genome sequences of Senna tora (Fabales: Fabaceae). Mitochondrial DNA Part B 4, 1283-1284.
Sang-Ho Kang*, Hyun Oh Lee, Chang-Kug Kim, Saemin Chang, Ji-Nam Kang, Si-Myung Lee
(2020) The complete chloroplast genomes of the medicinal plants, Senna tora and Senna occidentalis species. Mitochondrial DNA Part B 5, 1673-1674. (*corresponding author)
Sang-Ho Kang*, Woo-Haeng Lee, Chang-Muk Lee, Joon-Soo Sim, So Youn Won, S-Ra Han, Soo-Jin Kwon, Jung Sun Kim, Chang-Kug Kim*, Tae-Jin Oh* (2020) De novo transcriptome sequence of Senna tora provides insights into anthraquinone biosynthesis. PLoS One (In Press) (*corresponding author).
Sang-Ho Kang*, Ramesh Prasad Pandey, Chang-Muk Lee, Joon-Soo Sim, Jin-Tae Jeong, Beom-Soon Choi, Myunghee Jung, So Youn Won, Tae-Jin Oh, Yeisoo Yu, Nam-Hoon Kim, Ok Ran Lee, Tae-Ho Lee, Puspalata Bashyal, Tae-Su Kim, Chang-Kug Kim, Jung Sun Kim, Byoung Ohg Ahn,Seung Y. Rhee*, Jae Kyung Sohng* (2020) Genome-Enabled Discovery of Anthraquinone Biosynthesis in Senna tora. (Submitted) (*corresponding author)
The bellflower (Platycodon grandiflorus) belongs to the bellflower family (Campanulaceae). Its root has been used as a traditional medicine and also a popular food additive with therapeutic effects on bronchitis, asthma, tonsillitis, pulmonary tuberculosis in East Asia for over 2000 years. The most important bioactive components of P. grandiflorus are platycosides, especially platycodin D. A whole-genome assembly of P. grandiflorus accompanied by its transcriptome and methylome data. The genome-wide analysis reveals the evolution of P. grandiflorus specialized in platycoside biosynthesis as a medicinal herb. In particular, the triterpenoid saponin biosynthesis-related genes show clues on species-specific selection of key player genes towards platycoside biosynthesis and their function.
Genome assembly and annotation of P. grandiflorus
Jangbaek-doraji, a cultivar of P. grandiflorus was used for whole-genome sequencing after four generations of self-fertilization. The karyotype analysis confirmed the diploid genome of P. grandiflorus with four metacentric and five sub-metacentric chromosome pairs. The k-mer analysis estimated the genome size to be approximately 694.4 Mb. We produced 474.5x sequencing coverage of Illumina short reads and 5.7x TruSeq synthetic long-reads (TSLRs). A hybrid assembly resulted in a 680.1 Mb draft genome with 4,816 scaffolds. It covered 98.4% of the estimated P. grandiflorus genome size. The assembly captured 92.6% of the complete BUSCOs, showing few fragmented and missing BUSCO genes indicating its high quality assembly construction. Its assembly quality was also assessed by mapping short reads to itself and 98% of them were successfully aligned to the assembly with 25x sequencing coverage depth. The genome annotation of P. grandiflorus enabled us to identify candidate genes underlying secondary metabolite biosynthesis. In the annotation, we predicted 40,018 non-redundant protein-coding genes with an average length of 5,019 bp from repeat-masked genome using evidence-driven gene prediction methods coupled with ab initio prediction.
Scolopendra subspinipes mutilans has been used as an herbal medicine for paralysis and arthritis in oriental medicine since ancient times, and is still widely used as an herbal medicine. About 3,000 species are distributed worldwide, and among them, the domestic centipede is known to 4 order 9 family and 44 species, but the establishment of a classification system and ecological studies are insufficient. Therefore, we intend to obtain scolopendrid's unique genes through deciphering the new genome of scolopendrid, and to provide resource data for the scientific and systematic pharmacological investigation through functional analysis of these genes.
Scolopendrid revealed that it is composed of 28 chromosomes (2n = 28) of about 1.2 Gb of genome. A large-capacity next generation sequencing (NGS) method such as Illumina MiSeq, NextSeq sequencing and PacBio RSII sequencing, long reads sequencer, were performed for the study. As a result, a total of 525 Gb (442X) of illumina sequence (234 Gb-10 Paired-end libraries, 292 Gb-12 mate pair libraries) and 51 Gb (45X) of Pacbio reads were produced as genomic sequences.
The scolopendrid genome consisted of a 1.1 Gb sequence consisting of the final 60,045 contigs. To this end, sequences derived from organelles such as mitochondria and genomic sequences of foreign species were removed and de novo assembly using CLC Assembly Cell was performed. N50 of the constructed genome sequence is 106,746 bp and 32.32% of GC% was confirmed. The essential gene composition ratio using CEGMA was 99.19%, which was confirmed to be very complete.
In order to predict the gene, the transcript sequences of 12 tissues and developmental stages were produced, and protein sequences of 10 allied species were used. The final 21,501 genes were predicted, of which 18,219 (84.7%) were identified for biologically functioning genes. The most similar species to the scolopendrid were identified as the European land centipede (Strigamia maritima) and the velvet spider (Stegodyphus mimosarum).
Scolopendrid is one of the largest genomes and genes among the centipedes that have already been identified, and is a representative of centipedes. In addition, the evolutionary analysis using genes across the genome revealed that it diverged about 400 million years ago from insects and about 600 million years ago from arachnids. Interestingly, many of the toxic genes found in arachnids were also found in centipedes, which can be said to be a characteristic different from insects. Among them, scolopendrasin Ⅶ, one of the toxic genes of scolopendrid, was confirmed to have good pharmacological effects in connection with autoimmune diseases in the human immune system.
Hwang Jae Sam (email: email@example.com)
Lee Joon Ha (email: firstname.lastname@example.org)
Yoo WG, Lee JH, Shin Y, Shim JY, Jung M, Kang BC, Oh J, Seong J, Lee HK, Kong HS, Song KD, Yun EY, Kim IW, Kwon YN, Lee DG, Hwang UW, Park J, Hwang JS. Antimicrobial peptides in the centipede Scolopendra subspinipes mutilans., Functional & Integrative Genomics. 2014 Jun; 14(2): 275-283.