Search
Question: derfinder: fullCoverage() problems
0
4.2 years ago by
United States
jessica.hekman40 wrote:

Thanks for the previous help in using derfinder! At this point my code looks like this:

source("http://bioconductor.org/biocLite.R")
biocLite("derfinder")
library(derfinder)
fullCov1 <- fullCoverage(c("tophat/tophat-juncs-10_130/accepted_hits.bam"), chrs="1")



However, the object returned by fullCoverage shows 0 hits at every nucleotide position (I used summary() to check). I also tried loading a different chromosome instead of 1, and got an error for the 3 of them that I checked:

> fullCov2 <- fullCoverage(c("tophat/tophat-juncs-10_130/accepted_hits.bam"), chrs="2")
2014-10-20 16:02:37 fullCoverage: processing chromosome 2
2014-10-20 16:02:38 loadCoverage: finding chromosome lengths
Error: 1 errors; first error:
Error in .Call2("solve_user_SEW0", start, end, width, PACKAGE = "IRanges"): solving row 2: range cannot be determined from the supplied arguments (too many NAs)



My BAM files do have alignments in them, so I'm not sure what's going on or how to debug this problem. I'd appreciate insights!

Thanks,

Jessica

modified 4.2 years ago by Leonardo Collado Torres610 • written 4.2 years ago by jessica.hekman40

Hi Jessica,

I need more information to check your case. But before jumping into details, please try to follow the posting guide: http://www.bioconductor.org/help/support/posting-guide/ Even if I can guess the current version from the date of your question, it's best to show the actual output of the session information. At least for the package you are asking about: in this case derfinder.

Now, in your code by specifying chrs = "1", derfinder assumes that in your BAM file you chromosomes are labeled using the NCBI naming style. Maybe your BAM files have a different naming style given your error with chromosome '2'.

> library(GenomeInfoDb)
> chr <- '1'
> seqlevelsStyle(chr)
[1] "NCBI"
> packageVersion('GenomeInfoDb')
[1] ‘1.2.0’

It this the case? You can verify this using the equivalent of:

> library(Rsamtools)
> ?BamFile
starting httpd help server ... done
> fl <- system.file("extdata", "ex1.bam", package="Rsamtools",
+                   mustWork=TRUE)
> bf <- BamFile(fl)
> scanBamHeader(bf)$targets seq1 seq2 1575 1584 > packageVersion('Rsamtools') [1] ‘1.18.0’ ## Or use data in derfinder > library(derfinder) > datadir <- system.file('extdata', 'genomeData', package='derfinder') > files <- rawFiles(datadir = datadir, samplepatt = '*accepted_hits.bam$',
+     fileterm = NULL)
> scanBamHeader(BamFile(files[1]))$targets 1 10 11 12 13 14 15 249250621 135534747 135006516 133851895 115169878 107349540 102531392 16 17 18 19 2 20 21 90354753 81195210 78077248 59128983 243199373 63025520 48129895 22 3 4 5 6 7 8 51304566 198022430 191154276 180915260 171115067 159138663 146364022 9 MT X Y 141213431 16569 155270560 59373566  Next, can you be a bit more explicit about using summary() to check? For example, did you do the equivalent of: > fullCov <- fullCoverage(files[1], chrs = '21', verbose = FALSE) > fullCov[[1]][[1]] integer-Rle of length 48129895 with 29 runs Lengths: 47410302 34 350 ... 358 33 711828 Values : 0 1 0 ... 0 1 0 > summary(fullCov[[1]][[1]]) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.0e+00 0.0e+00 0.0e+00 7.5e-06 0.0e+00 2.0e+00 > max(fullCov[[1]][[1]]) [1] 2 Finally, the output of traceback() run immediately after an error can be useful to debug the error. Cheers, Leo ADD REPLYlink modified 4.2 years ago • written 4.2 years ago by Leonardo Collado Torres610 Thanks, Leo. I believe R and all relevant packages are up to date: R version 3.1.1 (2014-07-10) Platform: x86_64-redhat-linux-gnu (64-bit) Bioconductor: 3.0 derfinder: 1.0.3 So far as I can tell I'm using the correct chromosome naming scheme -- how's it look to you? : > library(Rsamtools) > library(derfinder) > f1 <- BamFile("tophat/tophat-juncs-10_130/accepted_hits.bam") > scanBamHeader(BamFile(files[1]))$targets
1        10        11        12        13        14        15        16
249250621 135534747 135006516 133851895 115169878 107349540 102531392  90354753
17        18        19         2        20        21        22         3
81195210  78077248  59128983 243199373  63025520  48129895  51304566 198022430
4         5         6         7         8         9        MT         X
191154276 180915260 171115067 159138663 146364022 141213431     16569 155270560
Y
59373566

Here is how I called summary():

> fullCov1 <- fullCoverage(c("tophat/tophat-juncs-10_130/accepted_hits.bam"), chrs="1")
> summary(fullCov1$chr1[1,]) Min. 1st Qu. Median Mean 3rd Qu. Max. 0 0 0 0 0 0 And here is my call to fullCoverage() with a call to traceback() immediately afterwards -- I was unable to glean any more information from that call, sadly: > fullCov11 <- fullCoverage(c("tophat/tophat-juncs-10_130/accepted_hits.bam"), chrs="11") 2014-10-20 19:42:47 fullCoverage: processing chromosome 11 2014-10-20 19:42:47 loadCoverage: finding chromosome lengths Error: 1 errors; first error: Error in .Call2("solve_user_SEW0", start, end, width, PACKAGE = "IRanges"): solving row 2: range cannot be determined from the supplied arguments (too many NAs) For more information, use bplasterror(). To resume calculation, re-call the function and set the argument 'BPRESUME' to TRUE or wrap the previous call in bpresume(). First traceback: 20: fullCoverage(c("tophat/tophat-juncs-10_130/accepted_hits.bam"), chrs = "11") 19: bplapply(seq_len(length(chrs)), loadChr, files = files, chrs = chrs, bai = bai, chrlens = chrlens, outputs = outputs, cutoff = cutoff, mc.cores.load = mc.cores.load, ..., BPPARAM = BPPARAM) 18: bplapply(seq_len(length(chrs)), loadChr, files = files, chrs = chrs, bai = bai, chrlens = chrlens, outputs = outputs, cutoff = cutoff, mc.cores.load = mc.cores.load, ..., BPPARAM = BPPARAM) 17: lapply(X, FUN, ...) 16: lapply(X, FUN, ...) 15: FUN(1L[[1L]], ...) 14: .try > traceback() 5: stop(paste(msg, collapse = "\n\n"), call. = FALSE) 4: LastError$store(results = results, is.error = is.error, throw.error = TRUE)
3: bplapply(seq_len(length(chrs)), loadChr, files = files, chrs = chrs,
bai = bai, chrlens = chrlens, outputs = outputs, cutoff = cutoff,
mc.cores.load = mc.cores.load, ..., BPPARAM = BPPARAM)
2: bplapply(seq_len(length(chrs)), loadChr, files = files, chrs = chrs,
bai = bai, chrlens = chrlens, outputs = outputs, cutoff = cutoff,
mc.cores.load = mc.cores.load, ..., BPPARAM = BPPARAM)
1: fullCoverage(c("tophat/tophat-juncs-10_130/accepted_hits.bam"),
chrs = "11")

Thanks again.

Jessica

Hi Jessica,

When you ran:

> summary(fullCov1$chr1[1,]) Min. 1st Qu. Median Mean 3rd Qu. Max. 0 0 0 0 0 0 You got the summary for the coverage across all samples (just 1 in your case because you only are supplying 1 BAM file) for the first base of chromosome 1. This should be more obvious if you run: dim(fullCov1$chr1)


The appropriate summary call should be:

summary(fullCov1$chr1[[1]]) Next, you mixed objects when you ran: > f1 <- BamFile("tophat/tophat-juncs-10_130/accepted_hits.bam") > scanBamHeader(BamFile(files[1]))$targets

I was interested in the equivalent in your case. That is, run:

scanBamHeader(BamFile(f1))$targets As for the traceback() call, it doesn't help much in this case, but it does in others. Anyhow, lets see your scanBamHeader()$targets output first.

Finally, regarding the sessionInfo(), it's important to post it so others in the future can follow the thread. Otherwise they won't know what was "the latest Bioconductor version" or "latest package version". Plus, in other bug reports, the bug might have been fixed in another version, or it can help the package developer trace the problem.

Cheers,

Leo

ADD REPLYlink modified 4.2 years ago • written 4.2 years ago by Leonardo Collado Torres610

Sorry, I did mess up my code there, and it looks like indeed we ARE getting some results once I query the result of fullCoverage() correctly:

> summary(fullCov1$chr1[[1]]) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.000 0.000 0.000 0.788 0.000 21150.000 That's a relief! As for the chromosome naming conventions: > scanBamHeader(f1)$targets
1        10        11        12        13        14        15        16
122678785  69331447  74389097  72498081  63241923  60966679  64190966  59632846
17        18        19         2        20        21        22        23
64289059  55844845  53741614  85426708  58134056  50858623  61439934  52294480
24        25        26        27        28        29         3        30
47698779  51628933  38964690  45876710  41182112  41845238  91889043  40214260
31        32        33        34        35        36        37        38
39895921  38810281  31377067  42124431  26524999  30810995  30902991  23914537
4         5         6         7         8         9        MT         X
88276631  88915250  77573801  80974532  74330416  61074082     16727 123869142



I apologize for the sloppy code!

So it looks like derfinder is behaving as expected for chromosome 1 (thanks for catching that), but I am still perplexed why I'm getting that error with other chromosomes.

(Noted about reporting versions every time; will do.)

Jessica

Ok, so summary() is working. And you are indeed working with NCBI chromosome names, although you have more chromosomes than in human.

Hopefully scanBamHeaders()$targets will reveal more information. Otherwise I'm going to need a small reproducible example to debug the error. ADD REPLYlink written 4.2 years ago by Leonardo Collado Torres610 1 OK, I have a 28M BAM which just contains reads from chr11, and its index (or you could create the index file yourself if that's easier for you). fullCoverage() produces the same error on this file as on the full one. http://arborius.net/~jphekman/10_130-chr11.bam http://arborius.net/~jphekman/10_130-chr11.bam.bai small11 <- fullCoverage(c("10_130-chr11.bam"), chrs="11") (I may not be able to post again immediately -- I seem to have reached my limit as a new user and have to wait an hour.) Jessica ADD REPLYlink modified 4.2 years ago • written 4.2 years ago by jessica.hekman40 Thanks, I'll look into it! ADD REPLYlink written 4.2 years ago by Leonardo Collado Torres610 While I expand derfinder and make it more flexible, if you are indeed working only with chromosomes 1 to 22 and X, this will currently work: > library('derfinder') > bam <- '10_130-chr11.bam' > names(bam) <- 'sample1' > fullCov <- fullCoverage(bam, chrs = paste0('chr', c(1:22, 'X')), fileStyle='NCBI') ADD REPLYlink modified 4.2 years ago • written 4.2 years ago by Leonardo Collado Torres610 Also, if the genome that you are using is not on GenomeInfoDb, consider submitting the information required so it's added. See more at http://www.bioconductor.org/packages/release/bioc/vignettes/GenomeInfoDb/inst/doc/Accept-organism-for-GenomeInfoDb.pdf ADD REPLYlink written 4.2 years ago by Leonardo Collado Torres610 I am using Canis familiaris, so that explains it, I guess -- I should have mentioned that at the outset! I'll look into adding it to Bioconductor, but I think it's already there. CanFam has 38 chromosomes, but I can at least start working with the first 22 for now. Your code worked great. What is "sample1" -- where should I go to learn more about how/why setting names(bam) to sample1 helped? Thanks, Jessica ADD REPLYlink written 4.2 years ago by jessica.hekman40 Hi Jessica, I don't see Canis familiaris on the list of currently supported genomes by GenomeInfoDb: > names(genomeStyles()) [1] "Arabidopsis_thaliana" "Caenorhabditis_elegans" "Cyanidioschyzon_merolae" "Drosophila_melanogaster" "Homo_sapiens" [6] "Mus_musculus" "Oryza_sativa" "Populus_trichocarpa" "Saccharomyces_cerevisiae" "Zea_mays" > packageVersion('GenomeInfoDb') [1] ‘1.2.0’ In any case, I'm working on this issue. As for names(bam) <- 'sample1' That was just to make the printed output shorter (for example, run fullCov[[1]]). It had nothing to do with getting this to work. Cheers, Leo ADD REPLYlink written 4.2 years ago by Leonardo Collado Torres610 I've now written up a file for Canis familiaris to submit to GenomeInfoDb, but though the PDF documentation is very clear about how to format the file, it isn't clear about where to send it once it's complete -- do you know? Otherwise I'll just post a new query here (I searched this forum and also didn't find an answer). Jessica ADD REPLYlink written 4.2 years ago by jessica.hekman40 Hi Jessica, I just asked this question in the Bioc-devel mailing list (see https://stat.ethz.ch/pipermail/bioc-devel/2014-October/006520.html). We'll hear shortly as to whom to contact with the information you built. Cheers, Leo ADD REPLYlink written 4.2 years ago by Leonardo Collado Torres610 Hi Jessica, Please contact Sonali Arora for adding Canis familiaris to GenomeInfoDb. See https://stat.ethz.ch/pipermail/bioc-devel/2014-October/006523.html Cheers, Leo ADD REPLYlink written 4.1 years ago by Leonardo Collado Torres610 I emailed it -- thanks! I'm guessing that just having this file will not allow me to access the rest of the dog chromosomes with derfinder, right? I still need to wait for your updates? Thanks again for all your help; it's really appreciated. Jessica ADD REPLYlink written 4.1 years ago by jessica.hekman40 I think it didn't add more information -- I believe the chromosome naming scheme is correct, right? So you'll want an example? The BAM file is 1.1G, so I will see if I can get something smaller for you. Jessica ADD REPLYlink written 4.2 years ago by jessica.hekman40 If you just include a few reads, say 1000, that' be great. In the meantime, I have an idea of what could be going wrong. See: > library(GenomeInfoDb) > mapSeqlevels('2', 'NCBI') 2 [1,] "2" [2,] "LGII" > mapSeqlevels('11', 'NCBI') 11 [1,] "11" [2,] "LGXI" > mapSeqlevels(c('1', '2'), 'NCBI') 1 2 "1" "2" > genomeStyles()$Populus_trichocarpa[c(2, 11), ]
circular auto   sex NCBI JGI2.F
2     FALSE TRUE FALSE LGII      2
11    FALSE TRUE FALSE LGXI     11

derfinder uses GenomeInfoDb and your use case releaved that you get a non-human chromosome when mapping '2' or '11' to the 'NCBI' naming style. When you also map '1' along with '2' or '11', you only get human chromosome names as intended. Furthermore, in your use case you have non-standard chromosomes (if you are indeed working with Homo sapiens data).

In short, I'll see what I can do to make derfinder more flexible and species-aware.

ADD REPLYlink modified 4.2 years ago • written 4.2 years ago by Leonardo Collado Torres610
4
4.2 years ago by
United States
Leonardo Collado Torres610 wrote:

Hello,

Here is the full answer to this question.

During the interchange of comments, we realized two things. First, summary() was being used on the first row of one sample (which will frequently be all 0s) instead of all the bases in a chromosome from one sample. Second, derfinder relies on GenomeInfoDb for using the correct sequence naming style. The interaction with GenomeInfoDb was too rigid and derfinder broke when analyzing data from an organism not present in GenomeInfoDb.

After quite a bit of work, version 1.1.5 of derfinder (Bioc-devel) and 1.0.4 (Bioc-release) have fixed this issue as can be seen below.

First we load derfinder, specify the BAM file, and load the coverage data for all _Canis familiaris_ chromosomes.

> library('derfinder')
## BAM file: http://arborius.net/~jphekman/10_130-chr11.bam
## BAI file: http://arborius.net/~jphekman/10_130-chr11.bam.bai
> bam <- '10_130-chr11.bam'
> fullCov <- fullCoverage(bam, chrs = c(1:38, 'MT', 'X'),
+     species = 'canis_familiaris')
## Lots of messages deleted
## species = 'canis_familiaris' is at this moment not supported by GenomeInfoDb but should be soon
## given https://support.bioconductor.org/p/62136/#62180
## In any case, derfinder doesn't crash.

Next we can verify that there is only coverage information for chromosome 11 because that is the way the BAM file was constructed.

## Check that there is some coverage info for the first sample
## Given how the BAM file was constructed, there will only be data for chr 11.
> sapply(fullCov, function(x) { sum(x[[1]])})
1        2        3        4        5
0        0        0        0        0
6        7        8        9       10
0        0        0        0        0
11       12       13       14       15
39961057        0        0        0        0
16       17       18       19       20
0        0        0        0        0
21       22       23       24       25
0        0        0        0        0
26       27       28       29       30
0        0        0        0        0
31       32       33       34       35
0        0        0        0        0
36       37       38       MT        X
0        0        0        0        0

Below is the correct summary() call and an illustration of the dimensions of the data.

## Dimension per chr is number of base pairs in chr times number of samples
> dim(fullCov[[1]])
[1] 122678785         1

## Correct summary() call for the first sample in chr11
> summary(fullCov[['11']][[1]])
Min.  1st Qu.   Median     Mean  3rd Qu.
0.000    0.000    0.000    0.537    0.000
Max.
7799.000

Reproducibility information.

## Versions used
> packageVersion('derfinder')
[1] ‘1.1.5’
> packageVersion('GenomeInfoDb')
[1] ‘1.2.0’

If you like this answer, please select it as the answer to close this question (only possible if you asked the question) or upvote it (available for anyone).

Best,
Leo

PS Details here.

Incredibly fast update. I will be testing it out today or tomorrow. Thanks so much.

Jessica