Can I use BSgenome for in-house assemblies?
1
0
Entering edit mode
@f23c9071
Last seen 17 months ago
Germany

Hello,

I have an in-house assembly for a non-model organism. I am planning to forge a BSgenome data package. I have 2 questions regarding that:

1. As the assembly is not yet published it is not in any website, I don't have "source_url " for the description file.. Is there any workaround for that?
2. I couldn't really understand whether the built data package will be publicly available (for instance, pops up when available.genomes() is prompt) or will it be local?

Best,

Aybuge

BSgenome • 752 views
1
Entering edit mode
@james-w-macdonald-5106
Last seen 3 hours ago
United States
1. I don't think the source_url has to be real - it's just there for documentation purposes. It also might not be required, but if so, you could just use a placeholder like https://www.fakeurl.com or whatever.
2. No, it doesn't automatically cause the package to be publicly available.
0
Entering edit mode

Hi James,

Thanks a lot for your reply! I gave it a try with a fake source_urland it seems to be fine. However, I am having another problem which I believe to originate from "provider" filed of the Description file. The error is as below:

Error in forgeBSgenomeDataPkg(y, seqs_srcdir = seqs_srcdir, destdir = destdir, : values for symbols PROVIDERVERSION, RELEASENAME are not single strings
Traceback:

1. 1
2. 1
3. 3
4. ⋯
5. 3
6. 7


As I mentioned, the data is not from any of the conventional providers (like UCSC or NCBI) but in-house. Also, the .fasta file is organized by scaffolds but as per chromosomes. Could you please help me with that too?

Thanks,

Aybuge

1
Entering edit mode

Hi,

Also NA values don't qualify as "single strings" so it could be that this is what the error message is trying to tell you, admittedly in some sort of cryptic way.

Anyway I'm surprised that you would get an error about missing PROVIDERVERSION or RELEASENAME. These fields have been removed (or renamed) for a while and are no longer supported. What version of BSgenome are you using? The latest release version is 1.58.0. Note that BSgenome 1.58.0 belongs to Bioconductor 3.12 which requires R 4.0.

Cheers,

H.

0
Entering edit mode

Hi Hervé,

Thanks a lot for your reply! Indeed, I was using BSgenome 1.56.0 and as the Bioconductor vignette I was following is based on BSgenome 1.58.0, it did not include PROVIDERVERSION or RELEASENAME - so, I happened to give them as NA, leading to "single strings" error.

After updating the BSgenome to 1.58.0 version, forgeBSgenomeDataPkg worked fine!

Just a small suggestion: it might worth explicitly including circ_seqs: character(0) line for the 2bit file example in the vignette.

I am aware that the topic of the question is being skewed but I am now having a problem when building the package - specifically at the R CMD check <tarball> step:

Error: package or namespace load failed for ‘BSgenome.Cperspicillata.X’:
call: validObject(.Object)
error: invalid class “TwoBitFile” object: undefined class for slot "resource" ("characterORconnection")
Execution halted


I have converted my .fasta file to 2bit using faToTwoBit from UCSC faToTwoBit before creating the seed file, so I think it should be in a correct format. I am not sure whether it is due to my system. Any suggestion is much appreciated!

Best,

Aybuge

0
Entering edit mode

I've never used faToTwoBit() so don't know what could have gone wrong. FWIW I prefer to use Biostrings::readDNAStringSet() to load the FASTA file in R and then write the sequences back to disk in 2bit format. This allows more control like reordering the sequences. See this discussion for the topic of converting your FASTA file to the 2bit format, including pointers to scripts located in the BSgenome package that do this type of conversion.

H.

0
Entering edit mode

The error says what the error is! The values you have for the provider and provider_version are not single strings. If you are unsure what a 'single string' is, here are <del>two</del> some examples:

this is not a single string
this_is_a_single_string
Neither is this
ButThisIs

0
Entering edit mode

Hi James,

Of course, I have checked all the seed file fields via isSingleString() before posting the question here, but I wasn't aware that NA values don't qualify as "single strings". The issue was related to my BSgenome version (see the comment below) and I solved it with Hervé's reply.

Thanks anyways!