Hello,
Here is the full answer to this question.
During the interchange of comments, we realized two things. First, `summary()` was being used on the first row of one sample (which will frequently be all 0s) instead of all the bases in a chromosome from one sample. Second, `derfinder` relies on `GenomeInfoDb` for using the correct sequence naming style. The interaction with `GenomeInfoDb` was too rigid and `derfinder` broke when analyzing data from an organism not present in `GenomeInfoDb`.
After quite a bit of work, version 1.1.5 of `derfinder` (Bioc-devel) and 1.0.4 (Bioc-release) have fixed this issue as can be seen below.
First we load `derfinder`, specify the BAM file, and load the coverage data for all _Canis familiaris_ chromosomes.
> library('derfinder')
## BAM file: http://arborius.net/~jphekman/10_130-chr11.bam
## BAI file: http://arborius.net/~jphekman/10_130-chr11.bam.bai
> bam <- '10_130-chr11.bam'
> fullCov <- fullCoverage(bam, chrs = c(1:38, 'MT', 'X'),
+ species = 'canis_familiaris')
## Lots of messages deleted
## species = 'canis_familiaris' is at this moment not supported by GenomeInfoDb but should be soon
## given https://support.bioconductor.org/p/62136/#62180
## In any case, derfinder doesn't crash.
Next we can verify that there is only coverage information for chromosome 11 because that is the way the BAM file was constructed.
## Check that there is some coverage info for the first sample
## Given how the BAM file was constructed, there will only be data for chr 11.
> sapply(fullCov, function(x) { sum(x[[1]])})
1 2 3 4 5
0 0 0 0 0
6 7 8 9 10
0 0 0 0 0
11 12 13 14 15
39961057 0 0 0 0
16 17 18 19 20
0 0 0 0 0
21 22 23 24 25
0 0 0 0 0
26 27 28 29 30
0 0 0 0 0
31 32 33 34 35
0 0 0 0 0
36 37 38 MT X
0 0 0 0 0
Below is the correct `summary()` call and an illustration of the dimensions of the data.
## Dimension per chr is number of base pairs in chr times number of samples
> dim(fullCov[[1]])
[1] 122678785 1
## Correct summary() call for the first sample in chr11
> summary(fullCov[['11']][[1]])
Min. 1st Qu. Median Mean 3rd Qu.
0.000 0.000 0.000 0.537 0.000
Max.
7799.000
Reproducibility information.
## Versions used
> packageVersion('derfinder')
[1] ‘1.1.5’
> packageVersion('GenomeInfoDb')
[1] ‘1.2.0’
If you like this answer, please select it as the answer to close this question (only possible if you asked the question) or upvote it (available for anyone).
Best,
Leo
PS Details here.
Hi Jessica,
I need more information to check your case. But before jumping into details, please try to follow the posting guide: http://www.bioconductor.org/help/support/posting-guide/ Even if I can guess the current version from the date of your question, it's best to show the actual output of the session information. At least for the package you are asking about: in this case `derfinder`.
Now, in your code by specifying `chrs = "1"`, `derfinder` assumes that in your BAM file you chromosomes are labeled using the NCBI naming style. Maybe your BAM files have a different naming style given your error with chromosome '2'.
It this the case? You can verify this using the equivalent of:
Next, can you be a bit more explicit about using `summary()` to check? For example, did you do the equivalent of:
Finally, the output of `traceback()` run immediately after an error can be useful to debug the error.
Cheers,
Leo
Thanks, Leo. I believe R and all relevant packages are up to date:
So far as I can tell I'm using the correct chromosome naming scheme -- how's it look to you? :
Here is how I called summary():
And here is my call to fullCoverage() with a call to traceback() immediately afterwards -- I was unable to glean any more information from that call, sadly:
Thanks again.
Jessica
Hi Jessica,
When you ran:
You got the summary for the coverage across all samples (just 1 in your case because you only are supplying 1 BAM file) for the first base of chromosome 1. This should be more obvious if you run:
The appropriate summary call should be:
Next, you mixed objects when you ran:
I was interested in the equivalent in your case. That is, run:
As for the `traceback()` call, it doesn't help much in this case, but it does in others. Anyhow, lets see your `scanBamHeader()$targets` output first.
Finally, regarding the `sessionInfo()`, it's important to post it so others in the future can follow the thread. Otherwise they won't know what was "the latest Bioconductor version" or "latest package version". Plus, in other bug reports, the bug might have been fixed in another version, or it can help the package developer trace the problem.
Cheers,
Leo
Sorry, I did mess up my code there, and it looks like indeed we ARE getting some results once I query the result of fullCoverage() correctly:
That's a relief!
As for the chromosome naming conventions:
I apologize for the sloppy code!
So it looks like derfinder is behaving as expected for chromosome 1 (thanks for catching that), but I am still perplexed why I'm getting that error with other chromosomes.
(Noted about reporting versions every time; will do.)
Jessica
Ok, so summary() is working. And you are indeed working with NCBI chromosome names, although you have more chromosomes than in human.
Hopefully `scanBamHeaders()$targets` will reveal more information. Otherwise I'm going to need a small reproducible example to debug the error.
OK, I have a 28M BAM which just contains reads from chr11, and its index (or you could create the index file yourself if that's easier for you). fullCoverage() produces the same error on this file as on the full one.
http://arborius.net/~jphekman/10_130-chr11.bam
http://arborius.net/~jphekman/10_130-chr11.bam.bai
(I may not be able to post again immediately -- I seem to have reached my limit as a new user and have to wait an hour.)
Jessica
Thanks, I'll look into it!
While I expand `derfinder` and make it more flexible, if you are indeed working only with chromosomes 1 to 22 and X, this will currently work:
Also, if the genome that you are using is not on GenomeInfoDb, consider submitting the information required so it's added. See more at http://www.bioconductor.org/packages/release/bioc/vignettes/GenomeInfoDb/inst/doc/Accept-organism-for-GenomeInfoDb.pdf
I am using Canis familiaris, so that explains it, I guess -- I should have mentioned that at the outset! I'll look into adding it to Bioconductor, but I think it's already there.
CanFam has 38 chromosomes, but I can at least start working with the first 22 for now. Your code worked great. What is "sample1" -- where should I go to learn more about how/why setting names(bam) to sample1 helped?
Thanks,
Jessica
Hi Jessica,
I don't see Canis familiaris on the list of currently supported genomes by `GenomeInfoDb`:
In any case, I'm working on this issue.
As for
That was just to make the printed output shorter (for example, run `fullCov[[1]]`). It had nothing to do with getting this to work.
Cheers,
Leo
I've now written up a file for Canis familiaris to submit to GenomeInfoDb, but though the PDF documentation is very clear about how to format the file, it isn't clear about where to send it once it's complete -- do you know? Otherwise I'll just post a new query here (I searched this forum and also didn't find an answer).
Jessica
Hi Jessica,
I just asked this question in the Bioc-devel mailing list (see https://stat.ethz.ch/pipermail/bioc-devel/2014-October/006520.html). We'll hear shortly as to whom to contact with the information you built.
Cheers,
Leo
Hi Jessica,
Please contact Sonali Arora for adding Canis familiaris to GenomeInfoDb. See https://stat.ethz.ch/pipermail/bioc-devel/2014-October/006523.html
Cheers,
Leo
I emailed it -- thanks! I'm guessing that just having this file will not allow me to access the rest of the dog chromosomes with derfinder, right? I still need to wait for your updates?
Thanks again for all your help; it's really appreciated.
Jessica
I think it didn't add more information -- I believe the chromosome naming scheme is correct, right? So you'll want an example?
The BAM file is 1.1G, so I will see if I can get something smaller for you.
Jessica
If you just include a few reads, say 1000, that' be great. In the meantime, I have an idea of what could be going wrong.
See:
`derfinder` uses `GenomeInfoDb` and your use case releaved that you get a non-human chromosome when mapping '2' or '11' to the 'NCBI' naming style. When you also map '1' along with '2' or '11', you only get human chromosome names as intended. Furthermore, in your use case you have non-standard chromosomes (if you are indeed working with Homo sapiens data).
In short, I'll see what I can do to make `derfinder` more flexible and species-aware.