Question

derfinder: fullCoverage() problems

0

Entering edit mode

jessica.hekman ▴ 40

@jessicahekman-6877

Last seen 10.2 years ago

United States

Thanks for the previous help in using derfinder! At this point my code looks like this:

source("http://bioconductor.org/biocLite.R")
biocLite("derfinder")
library(derfinder)
fullCov1 <- fullCoverage(c("tophat/tophat-juncs-10_130/accepted_hits.bam"), chrs="1")

However, the object returned by fullCoverage shows 0 hits at every nucleotide position (I used summary() to check). I also tried loading a different chromosome instead of 1, and got an error for the 3 of them that I checked:

> fullCov2 <- fullCoverage(c("tophat/tophat-juncs-10_130/accepted_hits.bam"), chrs="2")
2014-10-20 16:02:37 fullCoverage: processing chromosome 2
2014-10-20 16:02:38 loadCoverage: finding chromosome lengths
Error: 1 errors; first error:
  Error in .Call2("solve_user_SEW0", start, end, width, PACKAGE = "IRanges"): solving row 2: range cannot be determined from the supplied arguments (too many NAs)

My BAM files do have alignments in them, so I'm not sure what's going on or how to debug this problem. I'd appreciate insights!

Thanks,

Jessica

derfinder • 4.4k views

ADD COMMENT • link updated 11.1 years ago by Leonardo Collado Torres ★ 1.1k • written 11.1 years ago by jessica.hekman ▴ 40

0

Entering edit mode

Hi Jessica,

I need more information to check your case. But before jumping into details, please try to follow the posting guide: http://www.bioconductor.org/help/support/posting-guide/ Even if I can guess the current version from the date of your question, it's best to show the actual output of the session information. At least for the package you are asking about: in this case `derfinder`.

Now, in your code by specifying `chrs = "1"`, `derfinder` assumes that in your BAM file you chromosomes are labeled using the NCBI naming style. Maybe your BAM files have a different naming style given your error with chromosome '2'.

> library(GenomeInfoDb)
> chr <- '1'
> seqlevelsStyle(chr)
[1] "NCBI"
> packageVersion('GenomeInfoDb')
[1] ‘1.2.0’

It this the case? You can verify this using the equivalent of:

> library(Rsamtools)
> ?BamFile
starting httpd help server ... done
> fl <- system.file("extdata", "ex1.bam", package="Rsamtools",
+                   mustWork=TRUE)
> bf <- BamFile(fl)
> scanBamHeader(bf)$targets
seq1 seq2 
1575 1584 
> packageVersion('Rsamtools')
[1] ‘1.18.0’

## Or use data in `derfinder`

> library(derfinder)
> datadir <- system.file('extdata', 'genomeData', package='derfinder')
> files <- rawFiles(datadir = datadir, samplepatt = '*accepted_hits.bam$',
+     fileterm = NULL)
> scanBamHeader(BamFile(files[1]))$targets
        1        10        11        12        13        14        15 
249250621 135534747 135006516 133851895 115169878 107349540 102531392 
       16        17        18        19         2        20        21 
 90354753  81195210  78077248  59128983 243199373  63025520  48129895 
       22         3         4         5         6         7         8 
 51304566 198022430 191154276 180915260 171115067 159138663 146364022 
        9        MT         X         Y 
141213431     16569 155270560  59373566

Next, can you be a bit more explicit about using `summary()` to check? For example, did you do the equivalent of:

> fullCov <- fullCoverage(files[1], chrs = '21', verbose = FALSE)
> fullCov[[1]][[1]]
integer-Rle of length 48129895 with 29 runs
  Lengths: 47410302       34      350 ...      358       33   711828
  Values :        0        1        0 ...        0        1        0
> summary(fullCov[[1]][[1]])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.0e+00 0.0e+00 0.0e+00 7.5e-06 0.0e+00 2.0e+00 
> max(fullCov[[1]][[1]])
[1] 2

Finally, the output of `traceback()` run immediately after an error can be useful to debug the error.

Cheers,

Leo

ADD REPLY • link 11.1 years ago Leonardo Collado Torres ★ 1.1k

0

Entering edit mode

Thanks, Leo. I believe R and all relevant packages are up to date:

R version 3.1.1 (2014-07-10)
Platform: x86_64-redhat-linux-gnu (64-bit)
Bioconductor: 3.0
derfinder: 1.0.3

So far as I can tell I'm using the correct chromosome naming scheme -- how's it look to you? :

> library(Rsamtools)
> library(derfinder)
> f1 <- BamFile("tophat/tophat-juncs-10_130/accepted_hits.bam")
> scanBamHeader(BamFile(files[1]))$targets
        1        10        11        12        13        14        15        16
249250621 135534747 135006516 133851895 115169878 107349540 102531392  90354753
       17        18        19         2        20        21        22         3
81195210  78077248  59128983 243199373  63025520  48129895  51304566 198022430
        4         5         6         7         8         9        MT         X
191154276 180915260 171115067 159138663 146364022 141213431     16569 155270560
        Y
59373566

Here is how I called summary():

> fullCov1 <- fullCoverage(c("tophat/tophat-juncs-10_130/accepted_hits.bam"), chrs="1")
> summary(fullCov1$chr1[1,])

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
      0       0       0       0       0       0

And here is my call to fullCoverage() with a call to traceback() immediately afterwards -- I was unable to glean any more information from that call, sadly:

> fullCov11 <- fullCoverage(c("tophat/tophat-juncs-10_130/accepted_hits.bam"), chrs="11")
2014-10-20 19:42:47 fullCoverage: processing chromosome 11
2014-10-20 19:42:47 loadCoverage: finding chromosome lengths
Error: 1 errors; first error:
  Error in .Call2("solve_user_SEW0", start, end, width, PACKAGE = "IRanges"): solving row 2: range cannot be determined from the supplied arguments (too many NAs)

For more information, use bplasterror(). To resume calculation, re-call
  the function and set the argument 'BPRESUME' to TRUE or wrap the
  previous call in bpresume().

First traceback:
  20: fullCoverage(c("tophat/tophat-juncs-10_130/accepted_hits.bam"),
          chrs = "11")
  19: bplapply(seq_len(length(chrs)), loadChr, files = files, chrs = chrs,
          bai = bai, chrlens = chrlens, outputs = outputs, cutoff = cutoff,
          mc.cores.load = mc.cores.load, ..., BPPARAM = BPPARAM)
  18: bplapply(seq_len(length(chrs)), loadChr, files = files, chrs = chrs,
          bai = bai, chrlens = chrlens, outputs = outputs, cutoff = cutoff,
          mc.cores.load = mc.cores.load, ..., BPPARAM = BPPARAM)
  17: lapply(X, FUN, ...)
  16: lapply(X, FUN, ...)
  15: FUN(1L[[1L]], ...)
  14: .try
> traceback()
5: stop(paste(msg, collapse = "\n\n"), call. = FALSE)
4: LastError$store(results = results, is.error = is.error, throw.error = TRUE)
3: bplapply(seq_len(length(chrs)), loadChr, files = files, chrs = chrs,
       bai = bai, chrlens = chrlens, outputs = outputs, cutoff = cutoff,
       mc.cores.load = mc.cores.load, ..., BPPARAM = BPPARAM)
2: bplapply(seq_len(length(chrs)), loadChr, files = files, chrs = chrs,
       bai = bai, chrlens = chrlens, outputs = outputs, cutoff = cutoff,
       mc.cores.load = mc.cores.load, ..., BPPARAM = BPPARAM)
1: fullCoverage(c("tophat/tophat-juncs-10_130/accepted_hits.bam"),
       chrs = "11")

Thanks again.

Jessica

ADD REPLY • link 11.1 years ago jessica.hekman ▴ 40

0

Entering edit mode

Hi Jessica,

When you ran:

> summary(fullCov1$chr1[1,])

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
      0       0       0       0       0       0

You got the summary for the coverage across all samples (just 1 in your case because you only are supplying 1 BAM file) for the first base of chromosome 1. This should be more obvious if you run:

dim(fullCov1$chr1)

The appropriate summary call should be:

summary(fullCov1$chr1[[1]])

Next, you mixed objects when you ran:

> f1 <- BamFile("tophat/tophat-juncs-10_130/accepted_hits.bam")
> scanBamHeader(BamFile(files[1]))$targets

I was interested in the equivalent in your case. That is, run:

scanBamHeader(BamFile(f1))$targets

As for the `traceback()` call, it doesn't help much in this case, but it does in others. Anyhow, lets see your `scanBamHeader()$targets` output first.

Finally, regarding the `sessionInfo()`, it's important to post it so others in the future can follow the thread. Otherwise they won't know what was "the latest Bioconductor version" or "latest package version". Plus, in other bug reports, the bug might have been fixed in another version, or it can help the package developer trace the problem.

Cheers,

Leo

ADD REPLY • link 11.1 years ago Leonardo Collado Torres ★ 1.1k

0

Entering edit mode

Sorry, I did mess up my code there, and it looks like indeed we ARE getting some results once I query the result of fullCoverage() correctly:

> summary(fullCov1$chr1[[1]])
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
    0.000     0.000     0.000     0.788     0.000 21150.000

That's a relief!

As for the chromosome naming conventions:

> scanBamHeader(f1)$targets
        1        10        11        12        13        14        15        16
122678785  69331447  74389097  72498081  63241923  60966679  64190966  59632846
       17        18        19         2        20        21        22        23
 64289059  55844845  53741614  85426708  58134056  50858623  61439934  52294480
       24        25        26        27        28        29         3        30
 47698779  51628933  38964690  45876710  41182112  41845238  91889043  40214260
       31        32        33        34        35        36        37        38
 39895921  38810281  31377067  42124431  26524999  30810995  30902991  23914537
        4         5         6         7         8         9        MT         X
 88276631  88915250  77573801  80974532  74330416  61074082     16727 123869142

I apologize for the sloppy code!

So it looks like derfinder is behaving as expected for chromosome 1 (thanks for catching that), but I am still perplexed why I'm getting that error with other chromosomes.

(Noted about reporting versions every time; will do.)

Jessica

ADD REPLY • link 11.1 years ago jessica.hekman ▴ 40

0

Entering edit mode

Ok, so summary() is working. And you are indeed working with NCBI chromosome names, although you have more chromosomes than in human.

ADD REPLY • link 11.1 years ago Leonardo Collado Torres ★ 1.1k

0

Entering edit mode

Hopefully `scanBamHeaders()$targets` will reveal more information. Otherwise I'm going to need a small reproducible example to debug the error.

ADD REPLY • link 11.1 years ago Leonardo Collado Torres ★ 1.1k

1

Entering edit mode

OK, I have a 28M BAM which just contains reads from chr11, and its index (or you could create the index file yourself if that's easier for you). fullCoverage() produces the same error on this file as on the full one.

http://arborius.net/~jphekman/10_130-chr11.bam

http://arborius.net/~jphekman/10_130-chr11.bam.bai

small11 <- fullCoverage(c("10_130-chr11.bam"), chrs="11")

(I may not be able to post again immediately -- I seem to have reached my limit as a new user and have to wait an hour.)

Jessica

ADD REPLY • link 11.1 years ago jessica.hekman ▴ 40

0

Entering edit mode

Thanks, I'll look into it!

ADD REPLY • link 11.1 years ago Leonardo Collado Torres ★ 1.1k

0

Entering edit mode

While I expand `derfinder` and make it more flexible, if you are indeed working only with chromosomes 1 to 22 and X, this will currently work:

> library('derfinder')
> bam <- '10_130-chr11.bam'
> names(bam) <- 'sample1'
> fullCov <- fullCoverage(bam, chrs = paste0('chr', c(1:22, 'X')), fileStyle='NCBI')

ADD REPLY • link 11.1 years ago Leonardo Collado Torres ★ 1.1k

0

Entering edit mode

Also, if the genome that you are using is not on GenomeInfoDb, consider submitting the information required so it's added. See more at http://www.bioconductor.org/packages/release/bioc/vignettes/GenomeInfoDb/inst/doc/Accept-organism-for-GenomeInfoDb.pdf

ADD REPLY • link 11.1 years ago Leonardo Collado Torres ★ 1.1k

0

Entering edit mode

I am using Canis familiaris, so that explains it, I guess -- I should have mentioned that at the outset! I'll look into adding it to Bioconductor, but I think it's already there.

CanFam has 38 chromosomes, but I can at least start working with the first 22 for now. Your code worked great. What is "sample1" -- where should I go to learn more about how/why setting names(bam) to sample1 helped?

Thanks,

Jessica

ADD REPLY • link 11.1 years ago jessica.hekman ▴ 40

0

Entering edit mode

Hi Jessica,

I don't see Canis familiaris on the list of currently supported genomes by `GenomeInfoDb`:

> names(genomeStyles())
 [1] "Arabidopsis_thaliana"     "Caenorhabditis_elegans"   "Cyanidioschyzon_merolae"  "Drosophila_melanogaster"  "Homo_sapiens"            
 [6] "Mus_musculus"             "Oryza_sativa"             "Populus_trichocarpa"      "Saccharomyces_cerevisiae" "Zea_mays"                
> packageVersion('GenomeInfoDb')
[1] ‘1.2.0’

In any case, I'm working on this issue.

As for

names(bam) <- 'sample1'

That was just to make the printed output shorter (for example, run `fullCov[[1]]`). It had nothing to do with getting this to work.

Cheers,
Leo

ADD REPLY • link 11.1 years ago Leonardo Collado Torres ★ 1.1k

0

Entering edit mode

I've now written up a file for Canis familiaris to submit to GenomeInfoDb, but though the PDF documentation is very clear about how to format the file, it isn't clear about where to send it once it's complete -- do you know? Otherwise I'll just post a new query here (I searched this forum and also didn't find an answer).

Jessica

ADD REPLY • link 11.1 years ago jessica.hekman ▴ 40

0

Entering edit mode

Hi Jessica,

I just asked this question in the Bioc-devel mailing list (see https://stat.ethz.ch/pipermail/bioc-devel/2014-October/006520.html). We'll hear shortly as to whom to contact with the information you built.

Cheers,

Leo

ADD REPLY • link 11.1 years ago Leonardo Collado Torres ★ 1.1k

0

Entering edit mode

Hi Jessica,

Please contact Sonali Arora for adding Canis familiaris to GenomeInfoDb. See https://stat.ethz.ch/pipermail/bioc-devel/2014-October/006523.html

Cheers,

Leo

ADD REPLY • link 11.1 years ago Leonardo Collado Torres ★ 1.1k

0

Entering edit mode

I emailed it -- thanks! I'm guessing that just having this file will not allow me to access the rest of the dog chromosomes with derfinder, right? I still need to wait for your updates?

Thanks again for all your help; it's really appreciated.

Jessica

ADD REPLY • link 11.1 years ago jessica.hekman ▴ 40

0

Entering edit mode

I think it didn't add more information -- I believe the chromosome naming scheme is correct, right? So you'll want an example?

The BAM file is 1.1G, so I will see if I can get something smaller for you.

Jessica

ADD REPLY • link 11.1 years ago jessica.hekman ▴ 40

0

Entering edit mode

If you just include a few reads, say 1000, that' be great. In the meantime, I have an idea of what could be going wrong.

See:

> library(GenomeInfoDb)
> mapSeqlevels('2', 'NCBI')
     2     
[1,] "2"   
[2,] "LGII"
> mapSeqlevels('11', 'NCBI')
     11    
[1,] "11"  
[2,] "LGXI"
> mapSeqlevels(c('1', '2'), 'NCBI')
  1   2 
"1" "2" 
> genomeStyles()$Populus_trichocarpa[c(2, 11), ]
   circular auto   sex NCBI JGI2.F
2     FALSE TRUE FALSE LGII      2
11    FALSE TRUE FALSE LGXI     11

`derfinder` uses `GenomeInfoDb` and your use case releaved that you get a non-human chromosome when mapping '2' or '11' to the 'NCBI' naming style. When you also map '1' along with '2' or '11', you only get human chromosome names as intended. Furthermore, in your use case you have non-standard chromosomes (if you are indeed working with Homo sapiens data).

In short, I'll see what I can do to make `derfinder` more flexible and species-aware.

ADD REPLY • link 11.1 years ago Leonardo Collado Torres ★ 1.1k

score 4 · Accepted Answer · 2014-10-21

Hello,

Here is the full answer to this question.

During the interchange of comments, we realized two things. First, `summary()` was being used on the first row of one sample (which will frequently be all 0s) instead of all the bases in a chromosome from one sample. Second, `derfinder` relies on `GenomeInfoDb` for using the correct sequence naming style. The interaction with `GenomeInfoDb` was too rigid and `derfinder` broke when analyzing data from an organism not present in `GenomeInfoDb`.

After quite a bit of work, version 1.1.5 of `derfinder` (Bioc-devel) and 1.0.4 (Bioc-release) have fixed this issue as can be seen below.

First we load `derfinder`, specify the BAM file, and load the coverage data for all _Canis familiaris_ chromosomes.

> library('derfinder')
## BAM file: http://arborius.net/~jphekman/10_130-chr11.bam
## BAI file: http://arborius.net/~jphekman/10_130-chr11.bam.bai
> bam <- '10_130-chr11.bam'
> fullCov <- fullCoverage(bam, chrs = c(1:38, 'MT', 'X'),
+     species = 'canis_familiaris')
## Lots of messages deleted
## species = 'canis_familiaris' is at this moment not supported by GenomeInfoDb but should be soon
## given https://support.bioconductor.org/p/62136/#62180
## In any case, derfinder doesn't crash.

Next we can verify that there is only coverage information for chromosome 11 because that is the way the BAM file was constructed.

## Check that there is some coverage info for the first sample
## Given how the BAM file was constructed, there will only be data for chr 11.
> sapply(fullCov, function(x) { sum(x[[1]])})
       1        2        3        4        5 
       0        0        0        0        0 
       6        7        8        9       10 
       0        0        0        0        0 
      11       12       13       14       15 
39961057        0        0        0        0 
      16       17       18       19       20 
       0        0        0        0        0 
      21       22       23       24       25 
       0        0        0        0        0 
      26       27       28       29       30 
       0        0        0        0        0 
      31       32       33       34       35 
       0        0        0        0        0 
      36       37       38       MT        X 
       0        0        0        0        0 
       

Below is the correct `summary()` call and an illustration of the dimensions of the data.
       
## Dimension per chr is number of base pairs in chr times number of samples
> dim(fullCov[[1]])
[1] 122678785         1 

## Correct summary() call for the first sample in chr11
> summary(fullCov[['11']][[1]])
    Min.  1st Qu.   Median     Mean  3rd Qu. 
   0.000    0.000    0.000    0.537    0.000 
    Max. 
7799.000 

Reproducibility information.
      
## Versions used
> packageVersion('derfinder')
[1] ‘1.1.5’
> packageVersion('GenomeInfoDb')
[1] ‘1.2.0’

If you like this answer, please select it as the answer to close this question (only possible if you asked the question) or upvote it (available for anyone).

Best,
Leo

PS Details here.