Hello, I am trying to forge a genome for a non-model organism.
I have generated the following seed file (in dcf format) and using the available Athaliana seed file as an example.
Package: BSgenome.Bnapus.NCBI.Bra_napus_v2.0
Title: Full genome sequences for Brassica napus (Bra_napus_v2.0)
Description: Full genome sequences for Brassica napus as provided by NCBI (Bra_napus_v2.0, RefSeq assembly accession: GCF_000686985.2)
Version: 1.0.0
organism: Brassica napus
common_name: Rape
genome: Bra_napus_v2.0
provider: NCBI
release_date: 2017/09/22
source_url: https://www.ncbi.nlm.nih.gov/assembly/GCF_000686985.2
organism_biocview: Brassica_napus
seqnames: c("GCF_000686985.2_Bra_napus_v2.0_genomic.fa")
BSgenomeObjname: Bnapus
SrcDataFiles: GCF_000686985.2_Bra_napus_v2.0_genomic.fna.gz from https://ftp.ncbi.nlm.nih.gov/genomes/refseq/plant/Brassica_napus/latest_assembly_versions/GCF_000686985.2_Bra_napus_v2.0/GCF_000686985.2_Bra_napus_v2.0_genomic.fna.gz
PkgExamples: genome[["1"]]
seqs_srcdir: /home/edytas/scratch/bnapus_genome/BSgenomeForge/seqs_srcdir
ondisk_seq_format: fa
I have created a seqs_srcdir directory which contains my seed file as well as the genome fasta file GCF_000686985.2_Bra_napus_v2.0_genomic.fa. When I try to forge the genome I receive the error below. Any help or insights would be very much appreciated.
forgeBSgenomeDataPkg("bnapus_seed.dcf")
Error in .make_Seqinfo_from_genome(genome) :
"Bra_napus_v2.0" is not a registered NCBI assembly or UCSC genome (use
registered_NCBI_assemblies() or registered_UCSC_genomes() to list the
NCBI or UCSC assemblies/genomes currently registered in the
GenomeInfoDb package)
sessionInfo( )
``` R version 4.0.2 (2020-06-22) Platform: x86_64-pc-linux-gnu (64-bit) Running under: CentOS Linux 7 (Core)
Matrix products: default BLAS/LAPACK: /cvmfs/soft.computecanada.ca/easybuild/software/2020/Core/imkl/2020.1.217/compilers_and_libraries_2020.1.217/linux/mkl/lib/intel64_lin/libmkl_gf_lp64.so
locale:
[1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8
[5] LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8
[7] LC_PAPER=en_CA.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets [8] methods base
other attached packages:
[1] BiocParallel_1.24.1 GenomicAlignments_1.26.0
[3] SummarizedExperiment_1.20.0 MatrixGenerics_1.2.0
[5] matrixStats_0.57.0 GenomicFeatures_1.42.1
[7] AnnotationDbi_1.52.0 Biobase_2.50.0
[9] Rsamtools_2.6.0 BiocManager_1.30.10
[11] BSgenome_1.58.0 rtracklayer_1.50.0
[13] GenomicRanges_1.42.0 GenomeInfoDb_1.26.2
[15] stringr_1.4.0 Biostrings_2.58.0
[17] XVector_0.30.0 IRanges_2.24.1
[19] S4Vectors_0.28.1 BiocGenerics_0.36.0
loaded via a namespace (and not attached):
[1] progress_1.2.2 tidyselect_1.1.0 purrr_0.3.4
[4] lattice_0.20-41 vctrs_0.3.5 generics_0.1.0
[7] BiocFileCache_1.14.0 blob_1.2.1 XML_3.99-0.5
[10] rlang_0.4.9 pillar_1.4.7 glue_1.4.2
[13] DBI_1.1.0 rappdirs_0.3.1 bit64_4.0.5
[16] dbplyr_2.0.0 GenomeInfoDbData_1.2.4 lifecycle_0.2.0
[19] zlibbioc_1.36.0 memoise_1.1.0 biomaRt_2.46.0
[22] curl_4.3 Rcpp_1.0.5 openssl_1.4.3
[25] DelayedArray_0.16.0 bit_4.0.4 hms_0.5.3
[28] askpass_1.1 digest_0.6.27 stringi_1.5.3
[31] dplyr_1.0.2 grid_4.0.2 tools_4.0.2
[34] bitops_1.0-6 magrittr_2.0.1 RCurl_1.98-1.2
[37] RSQLite_2.2.1 tibble_3.0.4 crayon_1.3.4
[40] pkgconfig_2.0.3 ellipsis_0.3.1 Matrix_1.2-18
[43] xml2_1.3.2 prettyunits_1.1.1 assertthat_0.2.1
[46] httr_1.4.2 R6_2.5.0 compiler_4.0.2
Thank you. I was able to successfully forge and install the genome for Bnap by following your instructions.
Interestingly, having the following field: circ_seqs: "MT" in the seed file allows me to forge the genome without having the GenomeInfoDb 1.26.7 (release).
However, when this field is missing I receive the following error while trying to forge:
Error in .make_Seqinfo_from_genome(genome) : "Bra_napus_v2.0" is not a registered NCBI assembly or UCSC genome (use registered_NCBI_assemblies() or registered_UCSC_genomes() to list the NCBI or UCSC assemblies/genomes currently registered in the GenomeInfoDb package)
This is expected.
forgeBSgenomeDataPkg()
needs to know which sequences are circular and which are not. So if you don't specify thecirc_seqs
field it tries to get this information by callingGenomeInfoDb::Seqinfo(genome="Bra_napus_v2.0")
. However this only works for NCBI assemblies and UCSC genomes that are _registered_ in GenomeInfoDb.Note that
Bra_napus_v2.0
has 2 circular sequences:getChromInfoFromNCBI()
is the workhorse behindSeqinfo(genome="Bra_napus_v2.0")
.Cheers,
H.