Question

Possible bug at DiffBind-edgeR interface

0

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 22 hours ago

The city by the bay

One of the changes from DiffBind 1.12.3 to DiffBind 1.14.0 is at line 1045 of analyze.R, where the line:

    idx = match(as.integer(sites),1:nrow(pv$allvectors))

in the pv.getsites function is replaced with:

    idx = match(as.integer(sites), rownames(pv$allvectors))

Reading through the rest of the code indicates that sites is a vector taken from the annotation field from topTags (which, in turn, is the genes object in the DGEList) and pv is a DBA object. I presume that this command is meant to match up the p-value-sorted sites from topTags to the original order of the rows in pv.

However, the DGEList is constructed in line 232 of analyze.R (in the pv.DEinit function) as:

    res = DGEList(counts,lib.size=libsize,
              group=groups,genes=as.character(1:nrow(counts)))

Consider a case where the row names of pv$allvectors are not consecutive integers starting at 1. This means that matching of sites to the row names in pv.getsites will not recover the original ordering of rows. Rather, they should either be matched to 1:nrow(pv$allvectors), as done in version 1.12.3; or, the genes object should be set to the row names during DGEList construction.

This issue can be illustrated using the tamoxifen example:

data(tamoxifen_counts)
tamoxifen = dba.analyze(tamoxifen)
dba.report(tamoxifen)

This gives a top hit at "chr18:1302062-1303518" (paraphrased from GRanges output). If you shuffle the row names before proceeding:

rownames(tamoxifen$allvectors) <- sample(rownames(tamoxifen$allvectors))
tamoxifen = dba.analyze(tamoxifen) 
dba.report(tamoxifen)

you get a different result for version 1.14.0. In contrast, these two sets of commands yield the same result for version 1.12.3 (top hit of "chr18:23914866-23916341" in both cases).

Here's some session information for the run with version 1.14.0:

> sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: CentOS release 6.4 (Final)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base     

other attached packages:
 [1] DiffBind_1.14.0         RSQLite_1.0.0           DBI_0.3.1              
 [4] locfit_1.5-9.1          GenomicAlignments_1.4.1 Rsamtools_1.20.1       
 [7] Biostrings_2.36.0       XVector_0.8.0           limma_3.24.2           
[10] GenomicRanges_1.20.3    GenomeInfoDb_1.4.0      IRanges_2.2.1          
[13] S4Vectors_0.6.0         BiocGenerics_0.14.0    

loaded via a namespace (and not attached):
 [1] Rcpp_0.11.5            lattice_0.20-31        GO.db_3.1.2           
 [4] gtools_3.4.2           digest_0.6.8           plyr_1.8.2            
 [7] futile.options_1.0.0   BatchJobs_1.6          ShortRead_1.26.0      
[10] ggplot2_1.0.1          gplots_2.16.0          zlibbioc_1.14.0       
[13] annotate_1.46.0        gdata_2.13.3           Matrix_1.2-0          
[16] checkmate_1.5.2        systemPipeR_1.2.0      proto_0.3-10          
[19] GOstats_2.34.0         splines_3.2.0          BiocParallel_1.2.0    
[22] stringr_0.6.2          pheatmap_1.0.2         munsell_0.4.2         
[25] sendmailR_1.2-1        base64enc_0.1-2        BBmisc_1.9            
[28] fail_1.2               edgeR_3.10.0           XML_3.98-1.1          
[31] AnnotationForge_1.10.0 MASS_7.3-40            bitops_1.0-6          
[34] grid_3.2.0             RBGL_1.44.0            xtable_1.7-4          
[37] GSEABase_1.30.1        gtable_0.1.2           scales_0.2.4          
[40] graph_1.46.0           KernSmooth_2.23-14     amap_0.8-14           
[43] hwriter_1.3.2          reshape2_1.4.1         genefilter_1.50.0     
[46] latticeExtra_0.6-26    futile.logger_1.4.1    brew_1.0-6            
[49] rjson_0.2.15           lambda.r_1.1.7         RColorBrewer_1.1-2    
[52] tools_3.2.0            Biobase_2.28.0         Category_2.34.0       
[55] survival_2.38-1        AnnotationDbi_1.30.1   colorspace_1.2-6      
[58] caTools_1.17.1

DiffBind edgeR • 1.4k views

ADD COMMENT • link updated 10.2 years ago by Rory Stark ★ 5.2k • written 10.2 years ago by Aaron Lun ★ 28k

score 1 · Accepted Answer · 2015-04-29

1

Entering edit mode

Rory Stark ★ 5.2k

@rory-stark-5741

Last seen 6 months ago

Cambridge, UK

Hi Aaron-

This looks interesting, I'll take a look. It certainly wouldn't be the first time I introduced a new bug in the process of fixing a known issue!

What would be very useful would be if you have a use case where this shows up? It doesn't surprise me that randomly permuting an internal data field results in incorrect behavior. Did you run into a scenario where using the package via the documented interface triggers this error?

Cheers-

Rory

ADD COMMENT • link 10.2 years ago Rory Stark ★ 5.2k

0

Entering edit mode

I first encountered this behaviour in an analysis with some real paired-end ChIP-seq data, running DiffBind on peaks called by HOMER in each of 8 libraries (4 groups of 2 replicates). I reformatted the HOMER peak files into BED format beforehand. The DiffBind commands were:

current <- dba(sampleSheet=data.frame(SampleID=prefix, Condition=grouping,
    bamReads=bam.files, Peaks=unlist(all.peakfiles), PeakCaller="bed"), minOverlap=2)
current$config$singleEnd <- FALSE
current <- dba.count(current, bRemoveDuplicates=FALSE, bUseSummarizeOverlaps=TRUE)
current <- dba.contrast(current, minMembers=2, categories=DBA_CONDITION)
current <- dba.analyze(current, method=DBA_EDGER)
out <- dba.report(current, contrast=1, th=1)

where bam.files and all.peakfiles are character vectors of the respective files, and prefix is a character vector of file accessions. There are 100132 sites in total; most of these have row names of allvectors that are in order, but scattered throughout are 5 sites that are not in order.

ADD REPLY • link 10.2 years ago Aaron Lun ★ 28k

score 1 · Accepted Answer · 2015-04-30

1

Entering edit mode

Rory Stark ★ 5.2k

@rory-stark-5741

Last seen 6 months ago

Cambridge, UK

Hi Aaron-

To clarify, the issues is with rownames(out), correct? These should be sorted by the same order as topTags, but they are not? Can you clarify what are the "5 sites that are not in order"? Is there a line of code that would compare what it should be to what it is to show the error?

It would be very helpful if you could send me a copy of your DBA object current, in two forms: a) after the call to dba(), but before counting, and b) the final one after the call to dba.analyze().

Thanks, I really would like to get to the bottom of this!

Cheers-

Rory

rory.stark@cruk.cam.ac.uk

ADD COMMENT • link 10.2 years ago Rory Stark ★ 5.2k

0

Entering edit mode

The main issue is that the genomic coordinates in out do not match up correctly with the statistics in the metadata of out. Running the code on version 1.12.3 gives me a top set (paraphrased for brevity) of:

            Coordinates      logFC         PValue          FDR
chr12:12935208-12944441      1.431     4.553e-143   4.559e-138
chr4:145669576-145682443     1.608     3.889e-119   1.947e-114
chr8:70465786-70479257      -1.605     5.757e-115   1.921e-110
chr1:135125190-135134391    -2.767     5.489e-96    1.374e-91
chr13:95517069-95528301      1.631     1.502e-89    3.008e-85

while doing the same on version 1.14.0 gives:

            Coordinates      logFC         PValue          FDR
chr12:12935208-12944441      1.431     4.553e-143   4.559e-138
chr4:145685538-145686888     1.608     3.889e-119   1.947e-114
chr8:70500394-70512743      -1.605     5.757e-115   1.921e-110
chr1:135125190-135134391    -2.767     5.489e-96    1.374e-91
chr13:95517069-95528301      1.631     1.502e-89    3.008e-85

You can see that the statistics are exactly the same, yet some of the genomic coordinates are different. Some checking suggests that the results from version 1.12.3 are correct, simply based on visually comparing the log-fold changes and abundances on the coverage profiles with those reported above.

For more technical detail; yes, the problem is that the row names of out (which are taken from the row names of allvectors) do not match up to the ordering of the rows in the genes field of topTags. As I mentioned, I think this is caused by the fact that there are several sites that are not in order for the row names of allvectors:

> is.unsorted(as.integer(rownames(current$allvectors)))
[1] TRUE

For example:

> rownames(current$allvectors)[97735:97740]
[1] "97732"  "97733"  "100132" "100130" "97734"  "97735"

The DBA object is a bit big, so I'll see if I can upload it somewhere instead of sending it to you directly.

ADD REPLY • link 10.2 years ago Aaron Lun ★ 28k

0

Entering edit mode

Okay, I've saved the DBA objects as serialized R objects and put them up on Dropbox:

https://www.dropbox.com/sh/rdsx1yffpqzi4ct/AABhxMQBZ9aHn4Jee1KA_5H4a?dl=0

The "before" file contains the object just after creation but before counting (i.e., before the ...$config <- FALSE step) while the "after" file contains the object after running dba.analyze. My previous mentions of current refer to the object after analysis, i.e., that in the "after" file.

ADD REPLY • link 10.2 years ago Aaron Lun ★ 28k

1

Entering edit mode

Thanks Aaron, I got what I needed and found the bug.

The new line of code isn't the problem, it just exposed an older one what's been lurking for a while. It has to do with the sort order applied to chromosome names. I'll post here when I check in a fix.

Thanks again for helping me get to the bottom of this!

Cheers-

Rory

ADD REPLY • link 10.2 years ago Rory Stark ★ 5.2k