Question

Error - BSgenomeForge function

0

Entering edit mode

Vijay Lakhujani • 0

@vijay-lakhujani-11556

Last seen 5.5 years ago

India

Hi,

I am using the MEDIPS package for MeDIP-seq analysis where I am trying to create a custom genome for Vigna radiata (not already present in the package) and I am referring to below manual for guidance:

https://www.bioconductor.org/packages/devel/bioc/vignettes/BSgenome/inst/doc/BSgenomeForge.pdf

I have:

1. Genome file (gzipped) of V.radiata.

2. Seed file called BSgenome.Vradiata.UCSC.vr1-seed created referring to the manual (above link) section 2.2 Prepare the BSgenome data package seed file at page#4.

Now, I ran following commands:

> library(BSgenome)
> forgeBSgenomeDataPkg("path/to/my/seed")

where path/to/my/seed = my actual path to seed file.

I encountered following error:

Creating package in ./BSgenome.Vradiata.UCSC.vr1
Error in fetchSequenceInfo(genome) : genome "vr1" is not supported

Here is the session information:

> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS release 6.6 (Final)

locale:
 [1] LC_CTYPE=en_US.utf8       LC_NUMERIC=C             
 [3] LC_TIME=en_US.utf8        LC_COLLATE=en_US.utf8    
 [5] LC_MONETARY=en_US.utf8    LC_MESSAGES=en_US.utf8   
 [7] LC_PAPER=en_US.utf8       LC_NAME=C                
 [9] LC_ADDRESS=C              LC_TELEPHONE=C           
[11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C      

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base     

other attached packages:
 [1] MEDIPS_1.22.0        Rsamtools_1.24.0     BSgenome_1.40.1     
 [4] rtracklayer_1.32.2   Biostrings_2.40.2    XVector_0.12.1      
 [7] GenomicRanges_1.24.3 GenomeInfoDb_1.8.7   IRanges_2.6.1       
[10] S4Vectors_0.10.3     BiocGenerics_0.18.0

loaded via a namespace (and not attached):
 [1] DNAcopy_1.46.0             AnnotationDbi_1.34.4      
 [3] zlibbioc_1.18.0            GenomicAlignments_1.8.4   
 [5] BiocParallel_1.6.6         tools_3.3.1               
 [7] SummarizedExperiment_1.2.3 Biobase_2.32.0            
 [9] DBI_0.5-1                  gtools_3.5.0              
[11] preprocessCore_1.34.0      bitops_1.0-6              
[13] RCurl_1.95-4.8             biomaRt_2.28.0            
[15] RSQLite_1.0.0              XML_3.98-1.4

I am clueless. Please help.

PS: I can provide the seed file if required; please let me know.

BSgenomeForge • 1.1k views

ADD COMMENT • link 7.6 years ago • updated 7.5 years ago Vijay Lakhujani • 0

score 0 · Answer 1 · 2016-10-10

Hi,

I suspect you're calling something like Seqinfo(genome="vr1") somewhere in your seed file. Maybe in the seqnames field? Something like:

seqnames: seqnames(Seqinfo(genome="vr1"))

The error message you get comes from this call to Seqinfo(genome="vr1"), and is just saying that the Seqinfo() function doesn't support the vr1 genome.

According to the BSgenomeForge vignette:

seqnames: [OPTIONAL] Needed only if you are using a collection of sequence data files. In that case seqnames must be an R expression returning the names of the single sequences to forge (in a character vector). E.g. paste("chr", c(1:20, "X", "M", "Un", paste(c(1:20, "X", "Un"), "\_random", sep="")), sep="").

Do you have a collection of sequence data files? You say you have "Genome file (gzipped) of V.radiata" so it seems that you only have 1 single file with all the sequences in it. In that case you don't need the seqnames field.

Note that you can still put that field in your seed file, but that's useful only if you want to pick-up a subset of sequences to go in the BSgenome package, and/or if you want to control the order of the sequences that go in the BSgenome package. In that case, you must make sure that the R expression you put in the field is valid and returns the chromosome names of the vr1 genome. As you can see, seqnames(Seqinfo(genome="vr1")) is not a valid expression:

> seqnames(Seqinfo(genome="vr1"))
Error in fetchSequenceInfo(genome) : genome "vr1" is not supported

See ?Seqinfo for the list of genomes currently supported by Seqinfo(). It's a very small list at the moment.

Usually the expression used for the seqnames field explicitly creates and returns the vector of sequence names that must go in the BSgenome package, like in the example given in the BSgenomeForge vignette. If your assembly has thousands of scaffolds, and if you want to include them in the BSgenome package, coming up with this expression can be very tedious. So in that case, it's probably easier to not provide the seqnames field.

Hope this helps,

H.

score 0 · Answer 2 · 2016-11-09

Dear Hervé,

This was indeed helpful. Yes, I was using a single file with multiple fasta sequences as my custom genome.

Splitting the file into multiple fasta files (chromosome wise) resolved the issue. Thank you very much!

While I understand that answering a question as soon as it is posted on the forum is not possible, however, as an end user, it really becomes difficult to wait for an answer for more than a week. Sometimes, the problem (which could be a silly mistake from the end user's side) proves to be dead lock. Can there be a possible solution to this?

What should be a probable step as an end user to take in such cases. In my case, I posted the same question on biostars.org and many other bioinformatics forum but did not hear back anything.

This is strange for a package so widely used and so well documented. I agree that sometimes the answer is there in the manual but somehow it remains unnoticed. Of course, going through the manual a couple of times might help, but what could be the other workaround?

Kindly share your opinion. I am sure there are others going through the same problem.

Many thanks