Question

BSGenome Forge for non-model organism

0

Entering edit mode

sieminsk • 0

@3fc834aa

Last seen 3.3 years ago

Canada

Hello, I am trying to forge a genome for a non-model organism.

I have generated the following seed file (in dcf format) and using the available Athaliana seed file as an example.

Package: BSgenome.Bnapus.NCBI.Bra_napus_v2.0

Title: Full genome sequences for Brassica napus (Bra_napus_v2.0)

Description: Full genome sequences for Brassica napus as provided by NCBI (Bra_napus_v2.0, RefSeq assembly accession: GCF_000686985.2)

Version: 1.0.0

organism: Brassica napus

common_name: Rape

genome: Bra_napus_v2.0

provider: NCBI

release_date: 2017/09/22

source_url: https://www.ncbi.nlm.nih.gov/assembly/GCF_000686985.2

organism_biocview: Brassica_napus

seqnames: c("GCF_000686985.2_Bra_napus_v2.0_genomic.fa")

BSgenomeObjname: Bnapus

SrcDataFiles: GCF_000686985.2_Bra_napus_v2.0_genomic.fna.gz from https://ftp.ncbi.nlm.nih.gov/genomes/refseq/plant/Brassica_napus/latest_assembly_versions/GCF_000686985.2_Bra_napus_v2.0/GCF_000686985.2_Bra_napus_v2.0_genomic.fna.gz

PkgExamples: genome[["1"]]

seqs_srcdir: /home/edytas/scratch/bnapus_genome/BSgenomeForge/seqs_srcdir

ondisk_seq_format: fa

I have created a seqs_srcdir directory which contains my seed file as well as the genome fasta file GCF_000686985.2_Bra_napus_v2.0_genomic.fa. When I try to forge the genome I receive the error below. Any help or insights would be very much appreciated.


forgeBSgenomeDataPkg("bnapus_seed.dcf")
Error in .make_Seqinfo_from_genome(genome) : 
  "Bra_napus_v2.0" is not a registered NCBI assembly or UCSC genome (use
  registered_NCBI_assemblies() or registered_UCSC_genomes() to list the
  NCBI or UCSC assemblies/genomes currently registered in the
  GenomeInfoDb package)

sessionInfo( )

``` R version 4.0.2 (2020-06-22) Platform: x86_64-pc-linux-gnu (64-bit) Running under: CentOS Linux 7 (Core)

Matrix products: default BLAS/LAPACK: /cvmfs/soft.computecanada.ca/easybuild/software/2020/Core/imkl/2020.1.217/compilers_and_libraries_2020.1.217/linux/mkl/lib/intel64_lin/libmkl_gf_lp64.so

locale: [1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8
[5] LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8
[7] LC_PAPER=en_CA.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets [8] methods base

other attached packages: [1] BiocParallel_1.24.1 GenomicAlignments_1.26.0
[3] SummarizedExperiment_1.20.0 MatrixGenerics_1.2.0
[5] matrixStats_0.57.0 GenomicFeatures_1.42.1
[7] AnnotationDbi_1.52.0 Biobase_2.50.0
[9] Rsamtools_2.6.0 BiocManager_1.30.10
[11] BSgenome_1.58.0 rtracklayer_1.50.0
[13] GenomicRanges_1.42.0 GenomeInfoDb_1.26.2
[15] stringr_1.4.0 Biostrings_2.58.0
[17] XVector_0.30.0 IRanges_2.24.1
[19] S4Vectors_0.28.1 BiocGenerics_0.36.0

loaded via a namespace (and not attached): [1] progress_1.2.2 tidyselect_1.1.0 purrr_0.3.4
[4] lattice_0.20-41 vctrs_0.3.5 generics_0.1.0
[7] BiocFileCache_1.14.0 blob_1.2.1 XML_3.99-0.5
[10] rlang_0.4.9 pillar_1.4.7 glue_1.4.2
[13] DBI_1.1.0 rappdirs_0.3.1 bit64_4.0.5
[16] dbplyr_2.0.0 GenomeInfoDbData_1.2.4 lifecycle_0.2.0
[19] zlibbioc_1.36.0 memoise_1.1.0 biomaRt_2.46.0
[22] curl_4.3 Rcpp_1.0.5 openssl_1.4.3
[25] DelayedArray_0.16.0 bit_4.0.4 hms_0.5.3
[28] askpass_1.1 digest_0.6.27 stringi_1.5.3
[31] dplyr_1.0.2 grid_4.0.2 tools_4.0.2
[34] bitops_1.0-6 magrittr_2.0.1 RCurl_1.98-1.2
[37] RSQLite_2.2.1 tibble_3.0.4 crayon_1.3.4
[40] pkgconfig_2.0.3 ellipsis_0.3.1 Matrix_1.2-18
[43] xml2_1.3.2 prettyunits_1.1.1 assertthat_0.2.1
[46] httr_1.4.2 R6_2.5.0 compiler_4.0.2

BSGenomeForge BSGenome • 2.0k views

ADD COMMENT • link updated 3.7 years ago by Hervé Pagès 16k • written 3.7 years ago by sieminsk • 0

score 1 · Answer 1 · 2021-04-08

1

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 11 hours ago

Seattle, WA, United States

Hi,

Ideally you'd want to convert the big FASTA file provided by NCBI to the 2bit format, and use that to forge your BSgenome data package. There are some examples of how to do this in the inst/extdata/GentlemanLab/BSgenome.Amellifera.NCBI.AmelHAv3.1-tools/ and inst/extdata/GentlemanLab/BSgenome.Gmellonella.NCBI.ASM364042v2-tools/ folders in the BSgenome package (see the fasta_to_sorted_2bit.R script in those folders).

I've registered Bra_napus_v2.0 in GenomeInfoDb 1.26.7 (release) and GenomeInfoDb 1.27.11 (devel) so this should help. This assembly has 1471 sequences so specifying the seqnames in the seed file would be tricky. Both versions should become available via BiocManager::install() in about 36h.

You'll need to specify seqfile_name in your seed file. Take a look at the seed files that were used to forge BSgenome.Amellifera.NCBI.AmelHAv3.1 and BSgenome.Gmellonella.NCBI.ASM364042v2. They are in the inst/extdata/GentlemanLab/ folder. Your seed file should look very similar.

Cheers,

H.

ADD COMMENT • link 3.7 years ago Hervé Pagès 16k

0

Entering edit mode

Thank you. I was able to successfully forge and install the genome for Bnap by following your instructions.

Interestingly, having the following field: circ_seqs: "MT" in the seed file allows me to forge the genome without having the GenomeInfoDb 1.26.7 (release).

However, when this field is missing I receive the following error while trying to forge:

Error in .make_Seqinfo_from_genome(genome) : "Bra_napus_v2.0" is not a registered NCBI assembly or UCSC genome (use registered_NCBI_assemblies() or registered_UCSC_genomes() to list the NCBI or UCSC assemblies/genomes currently registered in the GenomeInfoDb package)

ADD REPLY • link 3.7 years ago sieminsk • 0

1

Entering edit mode

This is expected. forgeBSgenomeDataPkg() needs to know which sequences are circular and which are not. So if you don't specify the circ_seqs field it tries to get this information by calling GenomeInfoDb::Seqinfo(genome="Bra_napus_v2.0"). However this only works for NCBI assemblies and UCSC genomes that are _registered_ in GenomeInfoDb.

Note that Bra_napus_v2.0 has 2 circular sequences:

library(GenomeInfoDb)
ci <- getChromInfoFromNCBI("Bra_napus_v2.0")
subset(ci, circular)
#    SequenceName       SequenceRole AssignedMolecule GenBankAccn Relationship
# 20           MT assembled-molecule               MT        <NA>           <>
# 21         Pltd assembled-molecule             Pltd        <NA>           <>
#     RefSeqAccn AssemblyUnit SequenceLength UCSCStyleName circular
# 20 NC_008285.1  non-nuclear         221853          <NA>     TRUE
# 21 NC_016734.1  non-nuclear         152860          <NA>     TRUE

getChromInfoFromNCBI() is the workhorse behind Seqinfo(genome="Bra_napus_v2.0").

Cheers,

H.

ADD REPLY • link 3.7 years ago Hervé Pagès 16k

score 0 · Answer 2 · 2021-04-07

Try adding the seqnames argument to your seed file. The vignette says it's optional, but there is this part to forgeBSgenomeDataPkg:

 seqnames <- x@seqnames
    if (!is.na(seqnames)) {
        .seqnames <- eval(parse(text = seqnames))
    }
    else {
        if (is.na(x@seqfile_name)) {
            .seqnames <- seqlevels(Seqinfo(genome = x@genome))
        }
        else {
            .seqnames <- NULL

And you aren't passing in a seqnames argument, nor a seqfile_name argument, so it's gonna call Seqinfo on your genome, which does this:

> Seqinfo(genome = "Bra_napus_v2.0")
Error in .make_Seqinfo_from_genome(genome) : 
  "Bra_napus_v2.0" is not a registered NCBI assembly or UCSC genome (use
  registered_NCBI_assemblies() or registered_UCSC_genomes() to list the
  NCBI or UCSC assemblies/genomes currently registered in the
  GenomeInfoDb package)

Which you already saw...