Hello, I am working with a lesser-known fungal species that has a GenBank reference assembly, but it does not have RefSeq information or UCSC annotations. I can generate a txdb object from the .GFF/.GTF. However, I've unsuccessfully tried to generate orgdb & BSgenome objects. Ideally, I would like to access/extract .fasta sequence matched to GRanges for a variety of annotations such as gene/exon. Is it possible to generate orgdb and BSgenome objects without refseq for this organism?
chrom<-getChromInfoFromNCBI("GCA_007735645.1")
seqnames<-chrom$GenBankAccn
seqlengths<-chrom$SequenceLength
iscircular<-chrom$circular
s<-Seqinfo(seqnames=seqnames, seqlengths=seqlengths, isCircular=iscircular, genome="GCA_007735645.1")
tx<-makeTxDbFromGFF("C:\\Users\\mat14\\OneDrive\\Desktop\\wd\\V-eff\\GCA_007735645.1\\genomic.gff",
format="auto",chrominfo=s,
dataSource="GCA_007735645.1",
organism="Venturia.effusa",
taxonomyId=50376,dbxrefTag="GENEID")
> tx
TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: GCA_007735645.1
# Organism: Venturia.effusa
# Taxonomy ID: 50376
# miRBase build ID: NA
# Genome: GCA_007735645.1
# Nb of transcripts: 10897
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2023-10-23 08:29:29 -0400 (Mon, 23 Oct 2023)
# GenomicFeatures version at creation time: 1.52.2
# RSQLite version at creation time: 2.3.1
# DBSCHEMAVERSION: 1.2
setwd("C:\\Users\\mat14\\OneDrive\\Desktop\\wd\\db1\\src_seqdir")
seed<- read.dcf("C:\\Users\\mat14\\OneDrive\\Desktop\\wd\\db1\\src_seqdir\\seed_file.txt")
dcf<-write.dcf(seed,file="C:\\Users\\mat14\\OneDrive\\Desktop\\wd\\db1\\src_seqdir\\seed.dcf")
library(BSgenome)
unlink(c("BSgenome.Veffusa.NCBI"),recursive=TRUE,force=TRUE)
BSG<-forgeBSgenomeDataPkg("seed.dcf")
Error in .make_Seqinfo_from_genome(genome) :
"GCA_007735645.1" is not a registered NCBI assembly or UCSC genome (use
registered_NCBI_assemblies() or registered_UCSC_genomes() to list the NCBI or UCSC
assemblies/genomes currently registered in the GenomeInfoDb package)
makeOrgPackageFromNCBI(version = "0.1",
+ author = "M T <bnetscrname@aol.com>",
+ maintainer = "Bnet <netname@aol.com>",
+ outputDir = "C:\\Users\\mat14\\OneDrive\\Desktop\\wd\\db1",
+ tax_id = 50376,genus = "Venturia",species = "effusa")
If files are not cached locally this may take awhile to assemble a 33 GB cache databse in the NCBIFilesDir directory. Subsequent calls to this function should be faster (seconds). The cache will try to rebuild once per day.Please also see AnnotationHub for some pre-builtOrgDb downloads
preparing data from NCBI ...
starting download for
[1] gene2pubmed.gz
[2] gene2accession.gz
[3] gene2refseq.gz
[4] gene_info.gz
[5] gene2go.gz
getting data for gene2pubmed.gz
rebuilding the cache
extracting data for our organism from : gene2pubmed
getting data for gene2accession.gz
rebuilding the cache
extracting data for our organism from : gene2accession
getting data for gene2refseq.gz
rebuilding the cache
extracting data for our organism from : gene2refseq
getting data for gene_info.gz
rebuilding the cache
extracting data for our organism from : gene_info
getting data for gene2go.gz
rebuilding the cache
extracting data for our organism from : gene2go
processing gene2pubmed
processing gene_info: chromosomes
processing gene_info: description
Error in prepareDataFromNCBI(tax_id, NCBIFilesDir, outputDir, rebuildCache, :
no information found for species with tax id 50376
sessionInfo()
R version 4.3.1 (2023-06-16 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22621)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.utf8 LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8 LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8
time zone: America/New_York
tzcode source: internal
attached base packages:
[1] grid stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] AnnotationForge_1.42.2 karyoploteR_1.26.0 regioneR_1.32.0
[4] gplots_3.1.3 biovizBase_1.48.0 Gviz_1.44.2
[7] AnnotationHub_3.8.0 BiocFileCache_2.8.0 dbplyr_2.3.4
[10] GenomicFeatures_1.52.2 AnnotationDbi_1.62.2 GenomicAlignments_1.36.0
[13] SummarizedExperiment_1.30.2 Biobase_2.60.0 MatrixGenerics_1.12.3
[16] matrixStats_1.0.0 ggbio_1.48.0 Rsamtools_2.16.0
[19] Biostrings_2.68.1 XVector_0.40.0 ggplot2_3.4.2
[22] rtracklayer_1.60.1 GenomicRanges_1.52.0 GenomeInfoDb_1.36.4
[25] IRanges_2.34.1 S4Vectors_0.38.2 BiocGenerics_0.46.0
loaded via a namespace (and not attached):
[1] RColorBrewer_1.1-3 rstudioapi_0.15.0 magrittr_2.0.3
[4] rmarkdown_2.25 BiocIO_1.10.0 zlibbioc_1.46.0
[7] vctrs_0.6.3 memoise_2.0.1 RCurl_1.98-1.12
[10] base64enc_0.1-3 htmltools_0.5.6.1 S4Arrays_1.0.6
[13] progress_1.2.2 curl_5.1.0 Formula_1.2-5
[16] KernSmooth_2.23-21 htmlwidgets_1.6.2 plyr_1.8.9
[19] cachem_1.0.8 mime_0.12 lifecycle_1.0.3
[22] pkgconfig_2.0.3 Matrix_1.5-4.1 R6_2.5.1
[25] fastmap_1.1.1 GenomeInfoDbData_1.2.10 shiny_1.7.5
[28] digest_0.6.33 colorspace_2.1-0 GGally_2.1.2
[31] reshape_0.8.9 OrganismDbi_1.42.0 bezier_1.1.2
[34] Hmisc_5.1-1 RSQLite_2.3.1 filelock_1.0.2
[37] fansi_1.0.4 httr_1.4.7 abind_1.4-5
[40] compiler_4.3.1 bit64_4.0.5 withr_2.5.0
[43] htmlTable_2.4.1 backports_1.4.1 BiocParallel_1.34.2
[46] DBI_1.1.3 biomaRt_2.56.1 rappdirs_0.3.3
[49] DelayedArray_0.26.7 rjson_0.2.21 caTools_1.18.2
[52] gtools_3.9.4 tools_4.3.1 foreign_0.8-84
[55] interactiveDisplayBase_1.38.0 httpuv_1.6.11 nnet_7.3-19
[58] glue_1.6.2 restfulr_0.0.15 promises_1.2.1
[61] checkmate_2.2.0 cluster_2.1.4 reshape2_1.4.4
[64] generics_0.1.3 gtable_0.3.3 BSgenome_1.68.0
[67] ensembldb_2.24.1 data.table_1.14.8 hms_1.1.3
[70] xml2_1.3.5 utf8_1.2.3 BiocVersion_3.17.1
[73] pillar_1.9.0 stringr_1.5.0 later_1.3.1
[76] dplyr_1.1.3 lattice_0.21-8 deldir_1.0-9
[79] bit_4.0.5 tidyselect_1.2.0 RBGL_1.76.0
[82] knitr_1.44 gridExtra_2.3 ProtGenerics_1.32.0
[85] xfun_0.40 stringi_1.7.12 lazyeval_0.2.2
[88] yaml_2.3.7 evaluate_0.22 codetools_0.2-19
[91] interp_1.1-4 tibble_3.2.1 BiocManager_1.30.21
[94] graph_1.78.0 cli_3.6.1 rpart_4.1.19
[97] xtable_1.8-4 munsell_0.5.0 dichromat_2.0-0.1
[100] Rcpp_1.0.11 png_0.1-8 XML_3.99-0.14
[103] parallel_4.3.1 ellipsis_0.3.2 blob_1.2.4
[106] prettyunits_1.2.0 jpeg_0.1-10 latticeExtra_0.6-30
[109] AnnotationFilter_1.24.0 bitops_1.0-7 VariantAnnotation_1.46.0
[112] scales_1.2.1 crayon_1.5.2 bamsignals_1.32.0
[115] rlang_1.1.1 KEGGREST_1.40.1
@mat149 Just to let you know that you can replace:
with:
This will give you the standard seqnames which will be compatible with the seqnames of the BSgenome.Veffusa.NCBI.ASM773564v1 package obtained by using Jim's code above.
Best,
H.
In revisiting the OrgDb construction, do genes have to be in ENTREZ format for this to be operable? Given the complete absence of RefSeq data, there are no Entrez gene identifiers for this organism. However, I have locus_tag identifiers matched to GO id's that I curated from alternate databases, but I am unsure 'how to' & 'if I can' proceed given the error:
See the first argument for
makeOrgPackage
, which is the ellipsis argument (pointed out in the error you got). Also see the examples.