Error while attempting to forge BSgenome data package
5
0
Entering edit mode
pterry • 0
@pterry-6902
Last seen 2.3 years ago
United States

Motivation and goal: To prepare a BSgenome data package for the model organism, Setaria italica. The reason is that one appears to be needed according to section 2.3 of ggbio vignette dated August 26, 2014 to add a reference track eventually to be used in an 'overview plot' (chpt 4 of the vignette).

Problem: I received the following error message while attempting to create a new BSgenome package following the vignette 'How to forge a BSgenome data package' dated Oct. 13, 2014. I previously upgraded to Bioconductor 3.0, though don't know how to verify was successful. Thanks for comments.


library(BSgenome)
> forgeBSgenomeDataPkg("/Users/bterry/macbookpro2014/keenanres/Sitalica/packs/sitalica22/BSgenome.Sitalica.Ensembl.22-seed")
Creating package in ./BSgenome.Sitalica.Ensembl.22
Error in getSeqSrcpaths(seqnames, prefix = prefix, suffix = suffix, seqs_srcdir = seqs_srcdir) :
  /Users/bterry/macbookpro2014/keenanres/Sitalica/packs/sitalica22/chr1.fa, /Users/bterry/macbookpro2014/keenanres/Sitalica/packs/sitalica22/chr2.fa, /Users/bterry/macbookpro2014/keenanres/Sitalica/packs/sitalica22/chr3.fa, /Users/bterry/macbookpro2014/keenanres/Sitalica/packs/sitalica22/chr4.fa, /Users/bterry/macbookpro2014/keenanres/Sitalica/packs/sitalica22/chr5.fa, /Users/bterry/macbookpro2014/keenanres/Sitalica/packs/sitalica22/chr6.fa, /Users/bterry/macbookpro2014/keenanres/Sitalica/packs/sitalica22/chr7.fa, /Users/bterry/macbookpro2014/keenanres/Sitalica/packs/sitalica22/chr8.fa, /Users/bterry/macbookpro2014/keenanres/Sitalica/packs/sitalica22/chr9.fa, /Users/bterry/macbookpro2014/keenanres/Sitalica/packs/sitalica22/chr1_random.fa, /Users/bterry/macbookpro2014/keenanres/Sitalica/packs/sitalica22/chr2_random.fa, /Users/bterry/macbookpro2014/keenanres/Sitalica/packs/sitalica22/chr3_random.fa, /Users/bterry/macbookpro2014/keenanres/Sitalica/packs/sitalica22/chr4_random.fa,
>
> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-apple-darwin13.1.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] Biobase_2.26.0       BSgenome_1.34.0      rtracklayer_1.26.1   Biostrings_2.34.0   
 [5] XVector_0.6.0        GenomicRanges_1.18.1 GenomeInfoDb_1.2.0   IRanges_2.0.0       
 [9] S4Vectors_0.4.0      BiocGenerics_0.12.0

loaded via a namespace (and not attached):
 [1] base64enc_0.1-2         BatchJobs_1.4           BBmisc_1.7             
 [4] BiocParallel_1.0.0      bitops_1.0-6            brew_1.0-6             
 [7] checkmate_1.5.0         codetools_0.2-9         DBI_0.3.1              
[10] digest_0.6.4            fail_1.2                foreach_1.4.2          
[13] GenomicAlignments_1.2.0 iterators_1.0.7         RCurl_1.95-4.3         
[16] Rsamtools_1.18.0        RSQLite_0.11.4          sendmailR_1.2-1        
[19] stringr_0.6.2           tools_3.1.1             XML_3.98-1.1           
[22] zlibbioc_1.12.0        

 

bsgenome ggbio • 2.5k views
ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 11 hours ago
Seattle, WA, United States

Hi Philip,

Mmmh... unfortunately, the error message is truncated. It should en with "file(s) not found" which is self-explaining. I just modified the code in the BSgenome forge so next time this happens you will get something like:

Error in getSeqSrcpaths(seqnames, prefix = prefix, suffix = suffix, seqs_srcdir = seqs_srcdir) :
  file(s) not found: /Users/bterry/macbookpro2014/keenanres/Sitalica/packs/sitalica22/chr1.fa, /Users/bterry/macbookpro2014/keenanres/Sitalica/packs/sitalica22/chr2.fa, /Users/bterry/macbookpro2014/keenanres/Sitalica/packs/sitalica22/chr3.fa, /Users/bterry/macbookpro2014/keenanres/Sitalica/packs/sitalica22/chr4.fa, 

So please check that your /Users/bterry/macbookpro2014/keenanres/Sitalica/packs/sitalica22/ folder indeed contains FASTA files chr1.fa, chr2.fa, etc...

According to the "Obtain and prepare the sequence data" section of the BSgenome forge vignette:

    The sequence data must be in a single twoBit file (e.g. musFur1.2bit)
    or in a collection of FASTA files (possibly gzip-compressed).

    If the latter, then you need 1 FASTA file per sequence that you want to put
    in the target package. In that case the name of each FASTA file must be
    of the form <prefix><seqname><suffix> where <seqname> is the name of
    the sequence in it and <prefix> and <suffix> are a prefix and a suffix
    (possibly empty) that are the same for all the FASTA files.
 

So if you got all the sequences in a single big FASTA file, then you first need to split the file in one FASTA file per genomic sequence and place the resulting files in your /Users/bterry/macbookpro2014/keenanres/Sitalica/packs/sitalica22/ folder. The splitting can be performed with a few lines of code. See the splitbigfasta.R scripts in the various inst/extdata/GentlemanLab/*-tools/ folders in the BSgenome package for examples of how to do this (you will need to adapt the script to your specific situation). I'm aware this splitting process is a little bit tedious and I've on my list to modify forgeBSgenomeDataPkg() so that it would take care of it for you. That won't happen this week though.

Alternatively, if all you need are the sequences of the 9 chromosomes and you don't need the scaffold sequences, you could download the 9 Setaria_italica.JGIv2.0.23.dna.chromosome.*.fa.gz files from ftp://ftp.ensemblgenomes.org/pub/plants/release-23/fasta/setaria_italica/dna/ so the data is already split in 1 file per chromosome.

Don't hesitate to come back here again if you have more questions or need more help with this.

Cheers,

H.

ADD COMMENT
0
Entering edit mode
pterry • 0
@pterry-6902
Last seen 2.3 years ago
United States

Followup question about forging a BSgenome data package

Dear Herve,

Back to this project, in attempting to address your response to my initial post about this, still getting error message when trying to run 'forgeBSgenomeDataPkg' function.

did the following:

--took your suggestion,
downloaded the 9 Setaria_italica.JGIv2.0.22.dna.chromosome.*.fa.gz files

--set the 'seqs_srcdir:' parameter in the BSgenome.Sitalica.Ensembl.22-seed file to the absolute path of the dir. containing the 9 sequence data files.

--hopefully the file names I am using are acceptible, where '*' above is one of the numbers in range 1:9.

--the seed file contains:

Package: BSgenome.Sitalica.Ensembl.22
Title: Full genome sequences for Setaria italica (Ensembl, ver 2.0.22)
Version: 0.99.0
organism: Setaria italica
species: italica
provider: Ensembl
provider_version: 2.0.22
release_date: before 23 August 2014
release_name: JGI
source_url: ftp://ftp.ensemblgenomes.org/pub/plants/release-23/fasta/setaria_italica/dna/
organism_biocview: Setaria_italica
BSgenomeObjname: Sitalica
seqnames: paste("Setaria_italica.JGIv2.0.22.dna.chromosome.", c(1:9, paste(c(1:9), "_random", sep="")), sep="")
circ_seqs: NULL
mseqnames: NULL
SrcDataFiles: Setaria_italica.JGIv2.0.22.dna.genome.fa.gz from ftp://ftp.ensemblgenomes.org/pub/plants/release-22/fasta/setaria_italica/dna/
seqs_srcdir: /Users/bterry/macbookpro2014/keenanres/Sitalica/packs/sitalica22/seqs_srcdir
Author: P. Terry
Maintainer: P. Terry <pterry@huskers.unl.edu>
Description: Full genome sequences for Setaria italica (foxtail millet) as provided by Ensembl (2.0.22, 2014)
License: Artistic-2.0

So, ran the following:

> library(BSgenome)
> forgeBSgenomeDataPkg("/Users/bterry/macbookpro2014/keenanres/Sitalica/packs/sitalica22/BSgenome.Sitalica.Ensembl.22-seed")
Creating package in ./BSgenome.Sitalica.Ensembl.22
Error in getSeqSrcpaths(seqnames, prefix = prefix, suffix = suffix, seqs_srcdir = seqs_srcdir) :
  /Users/bterry/macbookpro2014/keenanres/Sitalica/packs/sitalica22/seqs_srcdir/Setaria_italica.JGIv2.0.22.dna.chromosome.1.fa, /Users/bterry/macbookpro2014/keenanres/Sitalica/packs/sitalica22/seqs_srcdir/Setaria_italica.JGIv2.0.22.dna.chromosome.2.fa, /Users/bterry/macbookpro2014/keenanres/Sitalica/packs/sitalica22/seqs_srcdir/Setaria_italica.JGIv2.0.22.dna.chromosome.3.fa, /Users/bterry/macbookpro2014/keenanres/Sitalica/packs/sitalica22/seqs_srcdir/Setaria_italica.JGIv2.0.22.dna.chromosome.4.fa, /Users/bterry/macbookpro2014/keenanres/Sitalica/packs/sitalica22/seqs_srcdir/Setaria_italica.JGIv2.0.22.dna.chromosome.5.fa, /Users/bterry/macbookpro2014/keenanres/Sitalica/packs/sitalica22/seqs_srcdir/Setaria_italica.JGIv2.0.22.dna.chromosome.6.fa, /Users/bterry/macbookpro2014/keenanres/Sitalica/packs/sitalica22/seqs_srcdir/Setaria_italica.JGIv2.0.22.dna.chromosome.7.fa, /Users/bterry/macbookpro2014/keenanres/Sitalica/packs/sitalica22/seqs_srcdir/Setaria_italica.JGIv2.0.22.dna.chromos
>

> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-apple-darwin13.1.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] Biobase_2.26.0       BSgenome_1.34.0      rtracklayer_1.26.1   Biostrings_2.34.0   
 [5] XVector_0.6.0        GenomicRanges_1.18.1 GenomeInfoDb_1.2.0   IRanges_2.0.0       
 [9] S4Vectors_0.4.0      BiocGenerics_0.12.0

loaded via a namespace (and not attached):
 [1] base64enc_0.1-2         BatchJobs_1.4           BBmisc_1.7             
 [4] BiocParallel_1.0.0      bitops_1.0-6            brew_1.0-6             
 [7] checkmate_1.5.0         codetools_0.2-9         DBI_0.3.1              
[10] digest_0.6.4            fail_1.2                foreach_1.4.2          
[13] GenomicAlignments_1.2.0 iterators_1.0.7         RCurl_1.95-4.3         
[16] Rsamtools_1.18.0        RSQLite_0.11.4          sendmailR_1.2-1        
[19] stringr_0.6.2           tools_3.1.1             XML_3.98-1.1           
[22] zlibbioc_1.12.0        
>


Can you suggest what to try to get this to work?


Thanks,
Philip Terry
Univ. Nebraska-Lincoln
pterry@huskers.unl.edu

 

ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 11 hours ago
Seattle, WA, United States

Hi Philip,

Sorry for the delay. Your seqnames field contains:

paste("Setaria_italica.JGIv2.0.22.dna.chromosome.", c(1:9, paste(c(1:9), "_random", sep="")), sep="")

which evaluates to:

 [1] "Setaria_italica.JGIv2.0.22.dna.chromosome.1"       
 [2] "Setaria_italica.JGIv2.0.22.dna.chromosome.2"       
 [3] "Setaria_italica.JGIv2.0.22.dna.chromosome.3"       
 [4] "Setaria_italica.JGIv2.0.22.dna.chromosome.4"       
 [5] "Setaria_italica.JGIv2.0.22.dna.chromosome.5"       
 [6] "Setaria_italica.JGIv2.0.22.dna.chromosome.6"       
 [7] "Setaria_italica.JGIv2.0.22.dna.chromosome.7"       
 [8] "Setaria_italica.JGIv2.0.22.dna.chromosome.8"       
 [9] "Setaria_italica.JGIv2.0.22.dna.chromosome.9"       
[10] "Setaria_italica.JGIv2.0.22.dna.chromosome.1_random"
[11] "Setaria_italica.JGIv2.0.22.dna.chromosome.2_random"
[12] "Setaria_italica.JGIv2.0.22.dna.chromosome.3_random"
[13] "Setaria_italica.JGIv2.0.22.dna.chromosome.4_random"
[14] "Setaria_italica.JGIv2.0.22.dna.chromosome.5_random"
[15] "Setaria_italica.JGIv2.0.22.dna.chromosome.6_random"
[16] "Setaria_italica.JGIv2.0.22.dna.chromosome.7_random"
[17] "Setaria_italica.JGIv2.0.22.dna.chromosome.8_random"
[18] "Setaria_italica.JGIv2.0.22.dna.chromosome.9_random"

This doesn't work because you don't have FASTA files corresponding to the "random" sequences (you said you downloaded the 9 Setaria_italica.JGIv2.0.22.dna.chromosome.*.fa.gz files). Remember that you must have 1 FASTA file per sequence name that you put in the seqnames field. Also you need to specify the suffix that needs to be added to the sequence names in order to obtain the corresponding file names, .fa.gz in your case. I would also suggest that you set the seqfiles_prefix field to Setaria_italica.JGIv2.0.22.dna.chromosome. so you keep the sequence names short and clean for the user.

So in your seed file:

seqnames: 1:9
seqfiles_prefix: Setaria_italica.JGIv2.0.22.dna.chromosome.
seqfile_suffix: .fa.gz

Let me know how it goes.

H.

ADD COMMENT
0
Entering edit mode
pterry • 0
@pterry-6902
Last seen 2.3 years ago
United States

Dear Herve,

A 3rd question on this thread.

I apparently succeeded in forging, building, checking & installing a 'bare' bones BSgenome data package, BSgenome.Sitalica.Ensembl.22.

Then ran into a problem trying to forge a BSgenome package with 'masked' seqs.

My seed file for this '2nd target package' is: BSgenome.Sitalica.Ensembl.22.masked-seed

Package: BSgenome.Sitalica.Ensembl.22.masked
Title: Full masked genome sequences for Setaria italica (Ensembl, ver 2.0.22)
Version: 0.99.0
RefPkgname: BSgenome.Sitalica.Ensembl.22
source_url: ftp://ftp.ensemblgenomes.org/pub/plants/release-23/fasta/setaria_italica/dna/
organism_biocview: Setaria_italica
mask_per_seq: 1
SrcDataFiles: RM masks: Setaria_italica.JGIv2.0.22.dna_rm.genome.fa.gz from ftp://ftp.ensemblgenomes.org/pub/plants/release-22/fasta/setaria_italica/dna/
masks_srcdir: /Users/bterry/macbookpro2014/keenanres/Sitalica/packs/sitalica22/masks_srcdir
RMfiles_name: paste(c(1:9), sep="")
RMfiles_prefix: Setaria_italica.JGIv2.0.22.dna_rm.chromosome.
RMfiles_suffix: .fa.gz
Author: P. Terry
Maintainer: P. Terry <pterry@huskers.unl.edu>
Description: Full genome sequences for Setaria italica (foxtail millet) as provided by Ensembl (2.0.22, 2014)
License: Artistic-2.0

So when I run the forge command:

> library(BSgenome)
> library(BSgenome.Sitalica.Ensembl.22)
> forgeMaskedBSgenomeDataPkg("/Users/bterry/macbookpro2014/keenanres/Sitalica/packs/sitalica22/BSgenome.Sitalica.Ensembl.22.masked-seed")
Error in makeS4FromList("MaskedBSgenomeDataPkgSeed", x) :
  some names in 'x' are not valid MaskedBSgenomeDataPkgSeed slots (mask_per_seq)
>
> sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods  
[9] base     

other attached packages:
 [1] BSgenome.Sitalica.Ensembl.22_0.99.0 BSgenome_1.34.0                    
 [3] rtracklayer_1.26.2                  Biostrings_2.34.0                  
 [5] XVector_0.6.0                       GenomicRanges_1.18.3               
 [7] GenomeInfoDb_1.2.3                  IRanges_2.0.0                      
 [9] S4Vectors_0.4.0                     BiocGenerics_0.12.1                

loaded via a namespace (and not attached):
 [1] base64enc_0.1-2         BatchJobs_1.5           BBmisc_1.8             
 [4] BiocParallel_1.0.0      bitops_1.0-6            brew_1.0-6             
 [7] checkmate_1.5.0         codetools_0.2-9         DBI_0.3.1              
[10] digest_0.6.6            fail_1.2                foreach_1.4.2          
[13] GenomicAlignments_1.2.1 iterators_1.0.7         RCurl_1.95-4.5         
[16] Rsamtools_1.18.2        RSQLite_1.0.0           sendmailR_1.2-1        
[19] stringr_0.6.2           tools_3.1.2             XML_3.98-1.1           
[22] zlibbioc_1.12.0        
>

So suspect a problem within the 3 fields,
RMfiles_name: paste(c(1:9), sep="")
RMfiles_prefix: Setaria_italica.JGIv2.0.22.dna_rm.chromosome.
RMfiles_suffix: .fa.gz
## or 'SrcDataFiles:'
Sec. 3.1.2 of vignette, RM masks:, expecting 'chromOut.tar.gz' input, but I have fasta files, the 9 Setaria_italica.JGIv2.0.22.dna_rm.chromosome.*.fa.gz files from ftp://ftp.ensemblgenomes.org/pub/plants/release-22/fasta/setaria_italica/dna/


Thanks for comments,
Philip Terry
Univ. Nebraska-Lincoln

 

ADD COMMENT
0
Entering edit mode

Hi Philip,

Are you sure you need the masks? Forging a BSgenome data package with masked sequences can be tricky. The good news is that most of the times it's not needed and using a BSgenome data package with bare sequences is enough. There are only very few use cases where using a BSgenome package with masks offers some (generally minor) advantage. So I would strongly recommend that you use the BSgenome package you forged (BSgenome.Sitalica.Ensembl.22) unless you have a good reason for forging and using BSgenome.Sitalica.Ensembl.22.masked.

Anyway the error you got says that mask_per_seq is not a valid field for your seed file. The correct field is nmask_per_seq (as you can see in the BSgenomeForge vignette). Also I should probably clarify this in the vignette but you cannot have the RM masks without having the AGAPS and AMB masks. More precisely, if you want the RM masks, you need to have the following masks in that order: (1) AGAPS, (2) AMB, (3) RM. You'll also need to set nmask_per_seq to 3.

Let me know how it goes.

H. 

ADD REPLY
0
Entering edit mode
@serfrazsaad-22250
Last seen 5.0 years ago

Hi Herve,

I would like to develop a genomic package for small contig of size 492 kb. I got an error at the last stage for package development. I will explain here When I run my

forgeBSgenomeDataPkg("../seed/SMELseed") Creating package in ./BSgenome.SMEL05585.UCSC Copying '/homedir/serfraz/work/Solanumsouthhampton/melongena/seqssrcdir//SMEL3Ch00.05585.fa' to './BSgenome.SMEL05585.UCSC/inst/extdata/singlesequences.2bit' ... DONE

then it gives error

R CMD build BSgenome.SMEL05585.UCSC * checking for file ‘BSgenome.SMEL05585.UCSC/DESCRIPTION’ ... OK * preparing ‘BSgenome.SMEL05585.UCSC’: * checking DESCRIPTION meta-information ... ERROR Malformed package version.

I modified seed, for example add information about author,depends etc but this seed give this error. I am novice in R. I would appreciate if you help me here

forgeBSgenomeDataPkg("../seed/SMEL_seed") Error in makeS4FromList("BSgenomeDataPkgSeed", x) : some names in 'x' are not valid BSgenomeDataPkgSeed slots (Date, Depends)

This is my seed, "Package: BSgenome.SMEL05585.UCSC Version: 1 Title:SMEL05585 contig Author: Saad Serfraz Maintainer: Saad Serfraz serfraz.saad@gmail.com Description: Ch0partial sequence organism: Solanum melongena commonname: Eggplant provider: UCSC providerversion: SMEL3 releasedate: Apr. 2019 releasename: SMEL consortium sourceurl: https://solgenomics.net/organism/Solanummelongena/genome organismbiocview: Solanummelongena BSgenomeObjname: SMEL SrcDataFiles: SMEL3Ch00.05585.fa from ftp://ftp.solgenomics.net/genomes/Solanummelongenaconsortium/assembly/V3/EggplantV3CH0.fa seqssrcdir: /homedir/serfraz/work/Solanumsouthhampton/melongena/seqssrcdir/ seqfile_name: SMEL3Ch00.05585.fa"

ADD COMMENT

Login before adding your answer.

Traffic: 859 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6