Question

Need help generating BSgenome and orgdb

0

Entering edit mode

mat149 ▴ 80

@mat149-11450

Last seen 3 months ago

United States

Hello, I am working with a lesser-known fungal species that has a GenBank reference assembly, but it does not have RefSeq information or UCSC annotations. I can generate a txdb object from the .GFF/.GTF. However, I've unsuccessfully tried to generate orgdb & BSgenome objects. Ideally, I would like to access/extract .fasta sequence matched to GRanges for a variety of annotations such as gene/exon. Is it possible to generate orgdb and BSgenome objects without refseq for this organism?

chrom<-getChromInfoFromNCBI("GCA_007735645.1")
seqnames<-chrom$GenBankAccn
seqlengths<-chrom$SequenceLength
iscircular<-chrom$circular
s<-Seqinfo(seqnames=seqnames, seqlengths=seqlengths, isCircular=iscircular, genome="GCA_007735645.1")

tx<-makeTxDbFromGFF("C:\\Users\\mat14\\OneDrive\\Desktop\\wd\\V-eff\\GCA_007735645.1\\genomic.gff",
    format="auto",chrominfo=s,
    dataSource="GCA_007735645.1",
    organism="Venturia.effusa",
    taxonomyId=50376,dbxrefTag="GENEID")
> tx
TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: GCA_007735645.1
# Organism: Venturia.effusa
# Taxonomy ID: 50376
# miRBase build ID: NA
# Genome: GCA_007735645.1
# Nb of transcripts: 10897
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2023-10-23 08:29:29 -0400 (Mon, 23 Oct 2023)
# GenomicFeatures version at creation time: 1.52.2
# RSQLite version at creation time: 2.3.1
# DBSCHEMAVERSION: 1.2

setwd("C:\\Users\\mat14\\OneDrive\\Desktop\\wd\\db1\\src_seqdir")

seed<- read.dcf("C:\\Users\\mat14\\OneDrive\\Desktop\\wd\\db1\\src_seqdir\\seed_file.txt")

dcf<-write.dcf(seed,file="C:\\Users\\mat14\\OneDrive\\Desktop\\wd\\db1\\src_seqdir\\seed.dcf")

library(BSgenome)
unlink(c("BSgenome.Veffusa.NCBI"),recursive=TRUE,force=TRUE)
BSG<-forgeBSgenomeDataPkg("seed.dcf")

Error in .make_Seqinfo_from_genome(genome) : 
  "GCA_007735645.1" is not a registered NCBI assembly or UCSC genome (use
  registered_NCBI_assemblies() or registered_UCSC_genomes() to list the NCBI or UCSC
  assemblies/genomes currently registered in the GenomeInfoDb package)


makeOrgPackageFromNCBI(version = "0.1",
+ author = "M T <bnetscrname@aol.com>",
+ maintainer = "Bnet <netname@aol.com>",
+ outputDir = "C:\\Users\\mat14\\OneDrive\\Desktop\\wd\\db1",
+ tax_id = 50376,genus = "Venturia",species = "effusa")

If files are not cached locally this may take awhile to assemble a 33 GB cache databse in the NCBIFilesDir directory. Subsequent calls to this function should be faster (seconds). The cache will try to rebuild once per day.Please also see AnnotationHub for some pre-builtOrgDb downloads
preparing data from NCBI ...
starting download for 
[1] gene2pubmed.gz
[2] gene2accession.gz
[3] gene2refseq.gz
[4] gene_info.gz
[5] gene2go.gz
getting data for gene2pubmed.gz
rebuilding the cache
extracting data for our organism from : gene2pubmed
getting data for gene2accession.gz
rebuilding the cache
extracting data for our organism from : gene2accession
getting data for gene2refseq.gz
rebuilding the cache
extracting data for our organism from : gene2refseq
getting data for gene_info.gz
rebuilding the cache
extracting data for our organism from : gene_info
getting data for gene2go.gz
rebuilding the cache
extracting data for our organism from : gene2go
processing gene2pubmed
processing gene_info: chromosomes
processing gene_info: description

Error in prepareDataFromNCBI(tax_id, NCBIFilesDir, outputDir, rebuildCache,  : 
  no information found for species with tax id 50376


sessionInfo()
R version 4.3.1 (2023-06-16 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22621)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8  LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] grid      stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] AnnotationForge_1.42.2      karyoploteR_1.26.0          regioneR_1.32.0            
 [4] gplots_3.1.3                biovizBase_1.48.0           Gviz_1.44.2                
 [7] AnnotationHub_3.8.0         BiocFileCache_2.8.0         dbplyr_2.3.4               
[10] GenomicFeatures_1.52.2      AnnotationDbi_1.62.2        GenomicAlignments_1.36.0   
[13] SummarizedExperiment_1.30.2 Biobase_2.60.0              MatrixGenerics_1.12.3      
[16] matrixStats_1.0.0           ggbio_1.48.0                Rsamtools_2.16.0           
[19] Biostrings_2.68.1           XVector_0.40.0              ggplot2_3.4.2              
[22] rtracklayer_1.60.1          GenomicRanges_1.52.0        GenomeInfoDb_1.36.4        
[25] IRanges_2.34.1              S4Vectors_0.38.2            BiocGenerics_0.46.0        

loaded via a namespace (and not attached):
  [1] RColorBrewer_1.1-3            rstudioapi_0.15.0             magrittr_2.0.3               
  [4] rmarkdown_2.25                BiocIO_1.10.0                 zlibbioc_1.46.0              
  [7] vctrs_0.6.3                   memoise_2.0.1                 RCurl_1.98-1.12              
 [10] base64enc_0.1-3               htmltools_0.5.6.1             S4Arrays_1.0.6               
 [13] progress_1.2.2                curl_5.1.0                    Formula_1.2-5                
 [16] KernSmooth_2.23-21            htmlwidgets_1.6.2             plyr_1.8.9                   
 [19] cachem_1.0.8                  mime_0.12                     lifecycle_1.0.3              
 [22] pkgconfig_2.0.3               Matrix_1.5-4.1                R6_2.5.1                     
 [25] fastmap_1.1.1                 GenomeInfoDbData_1.2.10       shiny_1.7.5                  
 [28] digest_0.6.33                 colorspace_2.1-0              GGally_2.1.2                 
 [31] reshape_0.8.9                 OrganismDbi_1.42.0            bezier_1.1.2                 
 [34] Hmisc_5.1-1                   RSQLite_2.3.1                 filelock_1.0.2               
 [37] fansi_1.0.4                   httr_1.4.7                    abind_1.4-5                  
 [40] compiler_4.3.1                bit64_4.0.5                   withr_2.5.0                  
 [43] htmlTable_2.4.1               backports_1.4.1               BiocParallel_1.34.2          
 [46] DBI_1.1.3                     biomaRt_2.56.1                rappdirs_0.3.3               
 [49] DelayedArray_0.26.7           rjson_0.2.21                  caTools_1.18.2               
 [52] gtools_3.9.4                  tools_4.3.1                   foreign_0.8-84               
 [55] interactiveDisplayBase_1.38.0 httpuv_1.6.11                 nnet_7.3-19                  
 [58] glue_1.6.2                    restfulr_0.0.15               promises_1.2.1               
 [61] checkmate_2.2.0               cluster_2.1.4                 reshape2_1.4.4               
 [64] generics_0.1.3                gtable_0.3.3                  BSgenome_1.68.0              
 [67] ensembldb_2.24.1              data.table_1.14.8             hms_1.1.3                    
 [70] xml2_1.3.5                    utf8_1.2.3                    BiocVersion_3.17.1           
 [73] pillar_1.9.0                  stringr_1.5.0                 later_1.3.1                  
 [76] dplyr_1.1.3                   lattice_0.21-8                deldir_1.0-9                 
 [79] bit_4.0.5                     tidyselect_1.2.0              RBGL_1.76.0                  
 [82] knitr_1.44                    gridExtra_2.3                 ProtGenerics_1.32.0          
 [85] xfun_0.40                     stringi_1.7.12                lazyeval_0.2.2               
 [88] yaml_2.3.7                    evaluate_0.22                 codetools_0.2-19             
 [91] interp_1.1-4                  tibble_3.2.1                  BiocManager_1.30.21          
 [94] graph_1.78.0                  cli_3.6.1                     rpart_4.1.19                 
 [97] xtable_1.8-4                  munsell_0.5.0                 dichromat_2.0-0.1            
[100] Rcpp_1.0.11                   png_0.1-8                     XML_3.99-0.14                
[103] parallel_4.3.1                ellipsis_0.3.2                blob_1.2.4                   
[106] prettyunits_1.2.0             jpeg_0.1-10                   latticeExtra_0.6-30          
[109] AnnotationFilter_1.24.0       bitops_1.0-7                  VariantAnnotation_1.46.0     
[112] scales_1.2.1                  crayon_1.5.2                  bamsignals_1.32.0            
[115] rlang_1.1.1                   KEGGREST_1.40.1

OrgDb BSgenome • 1.5k views

ADD COMMENT • link 2.2 years ago • updated 23 months ago mat149 ▴ 80

score 1 · Answer 1 · 2023-10-23

1

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 14 days ago

United States

The first is easy enough.

> library(BSgenomeForge)

> forgeBSgenomeDataPkgFromNCBI("GCA_007735645.1", "me <me@mine.org>", organism = "Venturia effusa", circ_seqs = character(0))
trying URL 'https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/007/735/645/GCA_007735645.1_ASM773564v1/GCA_007735645.1_ASM773564v1_genomic.fna.gz'
Content type 'application/x-gzip' length 13929394 bytes (13.3 MB)
downloaded 13.3 MB

Creating package in ./BSgenome.Veffusa.NCBI.ASM773564v1 
> install.packages("BSgenome.Veffusa.NCBI.ASM773564v1/", repos = NULL, type = "source")
<snip>
> Veffusa
| BSgenome object
| - organism: Venturia effusa
| - provider: NCBI
| - genome: ASM773564v1
| - 21 sequence(s):
|     1   2   3   4   5   6   7   8  
|     9   10  11  12  13  14  15  16 
|     17  18  19  20  MT             
| 
| Tips: call 'seqnames()' on the
| object to get all the sequence
| names, call 'seqinfo()' to get the
| full sequence info, use the '$' or
| '[[' operator to access a given
| sequence, see '?BSgenome' for more
| information.
> seqinfo(Veffusa)
Seqinfo object with 21 sequences from ASM773564v1 genome:
  seqnames seqlengths isCircular
  1           4119637      FALSE
  2           2929485      FALSE
  3           2900960      FALSE
  4           2873850      FALSE
  5           2784243      FALSE
  ...             ...        ...
  17          1626195      FALSE
  18          1476534      FALSE
  19          1241440      FALSE
  20           584679      FALSE
  MT           139035      FALSE
                genome
  1        ASM773564v1
  2        ASM773564v1
  3        ASM773564v1
  4        ASM773564v1
  5        ASM773564v1
  ...              ...
  17       ASM773564v1
  18       ASM773564v1
  19       ASM773564v1
  20       ASM773564v1
  MT       ASM773564v1

The OrgDb is going to be harder, as there apparently are no genes at either NCBI nor at Ensembl. You might be able to get something from JGI, but that's beyond the scope of this support site.

ADD COMMENT • link 2.2 years ago James W. MacDonald 68k

1

Entering edit mode

@mat149 Just to let you know that you can replace:

chrom<-getChromInfoFromNCBI("GCA_007735645.1")
seqnames<-chrom$GenBankAccn
seqlengths<-chrom$SequenceLength
iscircular<-chrom$circular
s<-Seqinfo(seqnames=seqnames, seqlengths=seqlengths, isCircular=iscircular, genome="GCA_007735645.1")

with:

s <- getChromInfoFromNCBI("GCA_007735645.1", as.Seqinfo=TRUE)

This will give you the standard seqnames which will be compatible with the seqnames of the BSgenome.Veffusa.NCBI.ASM773564v1 package obtained by using Jim's code above.

Best,

H.

ADD REPLY • link 2.2 years ago Hervé Pagès 16k

0

Entering edit mode

In revisiting the OrgDb construction, do genes have to be in ENTREZ format for this to be operable? Given the complete absence of RefSeq data, there are no Entrez gene identifiers for this organism. However, I have locus_tag identifiers matched to GO id's that I curated from alternate databases, but I am unsure 'how to' & 'if I can' proceed given the error:

go<-read.delim("C:\\Users\\MT\\Desktop\\wd\\dump\\veff-GO.txt",header=TRUE)
head(go)

           GID
1 FKW77_009824
2 FKW77_010604
3 FKW77_006092
4 FKW77_000504
5 FKW77_002416
6 FKW77_003967
                                                                                                                                              GO
1 GO:0003855; GO:0003856; GO:0003866; GO:0004764; GO:0004765; GO:0005524; GO:0005737; GO:0008652; GO:0009073; GO:0009423; GO:0016310; GO:0046872
2                                                 GO:0000103; GO:0004020; GO:0004781; GO:0005524; GO:0005737; GO:0009086; GO:0019344; GO:0070814
3                                     GO:0000166; GO:0005634; GO:0005737; GO:0009117; GO:0009204; GO:0035870; GO:0036220; GO:0036222; GO:0046872
4                                     GO:0002143; GO:0004792; GO:0005524; GO:0005829; GO:0006777; GO:0008641; GO:0046872; GO:0061604; GO:0061605
5                                     GO:0005737; GO:0006569; GO:0019805; GO:0030170; GO:0030429; GO:0034354; GO:0043420; GO:0061981; GO:0097053
6                                                             GO:0000139; GO:0004609; GO:0005509; GO:0005795; GO:0006646; GO:0010008; GO:0016540
                evidence
1 Inferred from homology
2 Inferred from homology
3 Inferred from homology
4 Inferred from homology
5 Inferred from homology
6 Inferred from homology

con <- dbConnect(SQLite(), "org.Veffusa.eg.sqlite")
con
<SQLiteConnection>
  Path: C:\Users\MT\Desktop\wd\dump\org.Veffusa.eg.sqlite
  Extensions: TRUE

dbGetQuery(con, "select * from gene_info where tax_id='50376';")
Error: no such table: gene_info

makeOrgPackage(version="1.0",maintainer = "Bnet <bnetscrname@aol.com>",author = "M T <bnetscrname@aol.com>",
                 outputDir="C:\\Users\\MT\\Desktop\\wd\\dump",
                 tax_id = "50376",genus="Venturia", species = "effusa",goTable='go')
Error in .makeOrgPackage(data, version = version, maintainer = maintainer,  : 
  'goTable' must be a name from the data.frames passed in '...'

ADD REPLY • link 23 months ago mat149 ▴ 80

0

Entering edit mode

See the first argument for makeOrgPackage, which is the ellipsis argument (pointed out in the error you got). Also see the examples.

ADD REPLY • link 23 months ago James W. MacDonald 68k