blocking factors in diffbind for edgeR and DESeq2 analysis
Entering edit mode
armandorp • 0
Last seen 11 months ago

Dear all,

I am doing a differential binding analysis of ChiP-Seq data for two conditions, i.e. Control and Patient. These samples were analysed in two different batches while keeping a proportional number of control and patient samples in each batch. I do not have input DNA and the sequencing data were paired-end.

I want to consider the batch effects in the model by blocking them as a factor. Thus, I decided to use DiffBind, edgeR and DESeq2 and their respective procedures for blocking factors. My results are different for the different tools and using the same matrix of read counts provided by DiffBind. I found some details about how the edgeR analysis in diffbind is done in DiffBind but I couldn’t find the detailed description for blocking or multiple-factors… 

So I am not sure what I am missing, I was expecting to have similar results between Edger/DESeq2 and DiffBind, I would appreciate your comments.

Here there are some details of the sample sheet and the followed procedure for the analyses. 

SampleID; Tissue; Factor; Condition; Treatment; Replicate; bamReads; Peaks; PeakCaller
sample_1; F; PAT; PAT; 1st_batch; 1; sample_1.bam; sample_1.bed; narrow
sample_2; F; PATM; PAT; 1st_batch; 1; sample_2.bam; sample_2.bed; narrow
sample_3; M; CONT; CONT; 1st_batch; 1; sample_3.bam; sample_3.bed; narrow
sample_4; F; CONT; CONT; 1st_batch; 1; sample_4.bam; sample_4.bed; narrow
sample_5; M; CONT; CONT; 1st_batch; 1; sample_5.bam; sample_5.bed; narrow
sample_6; M; PAT; PAT; 1st_batch; 1; sample_6.bam; sample_6.bed; narrow
sample_7; F; CONT; CONT; 1st_batch; 1; sample_7.bam; sample_7.bed; narrow
sample_8; M; PATM; PAT; 1st_batch; 1; sample_8.bam; sample_8.bed; narrow
sample_9; M; CONT; CONT; 1st_batch; 1; sample_9.bam; sample_9.bed; narrow
sample_10; F; CONT; CONT; 2nd_batch; 1; sample_10.bam; sample_10.bed; narrow
sample_11; F; CONT; CONT; 2nd_batch; 1; sample_11.bam; sample_11.bed; narrow
sample_12; F; PAT; PAT; 2nd_batch; 1; sample_12.bam; sample_12.bed; narrow
sample_13; F; PATM; PAT; 2nd_batch; 1; sample_13.bam; sample_13.bed; narrow
sample_14; M; PAT; PAT; 2nd_batch; 1; sample_14.bam; sample_14.bed; narrow
sample_15; M; PAT; PAT; 2nd_batch; 1; sample_15.bam; sample_15.bed; narrow

I am running DiffBind using these commands/parameters for blocking the factor related to batch effects:

diffdata <- dba(sampleSheet=samples, bCorPlot=FALSE)
diffdata$config$singleEnd <- FALSE
diffdata <- dba.count(diffdata, bCorPlot=FALSE, bUseSummarizeOverlaps=TRUE)
diffdata <- dba.contrast ( diffdata , categories = DBA_CONDITION , block = DBA_TREATMENT )
diffdata <- dba.analyze ( diffdata , bCorPlot = FALSE , method = DBA_ALL_METHODS_BLOCK )

This is the resulting report from DiffBind where the default threshold for the FDR is 0.1:

1 Contrast:
  Group1 Members1 Group2 Members2 Block1Val InBlock1 Block2Val InBlock2 DB.edgeR DB.edgeR.block DB.DESeq DB.DESeq.block
1    PAT       16   CONT       15 1st_batch       15 2nd_batch       16        1              8        0              6
  DB.DESeq2 DB.DESeq2.block
1         2               4

Then I am trying to repeat the same analysis in EDGER and DESEQ2 but using the resulting raw read counts matrix from DiffBind, and I have different numbers of peaks for edgeR.

# Raw Count from Consensus peaks from DiffBind
myDBA <- dba.count(diffdata, peaks=NULL, score=DBA_SCORE_READS, bCorPlot=FALSE)
ipeaks <- dba.peakset(myDBA, bRetrieve=TRUE, DataType=DBA_DATA_FRAME, peak.caller="narrow")
ipeaks <- ipeaks[,-c(1,2,3)]

samples$Condition <- factor(samples$Condition)
Condition <- factor(samples$Condition)
samples$Treatment <- factor(samples$Treatment)
Treatment <- factor(samples$Treatment)

design <- model.matrix(~Treatment + Condition)
rownames(design) <- samples$SampleID
rownames(samples) <- samples$SampleID


edgeR Analysis

y <- DGEList(counts = ipeaks, group = Condition)
y <- calcNormFactors(y, method="TMM")
y <- estimateCommonDisp(y)
y <- estimateGLMCommonDisp(y, design)
y$GLM <- glmFit(y,design)
y$LRT <- glmLRT(y$GLM, coef=3)
dbpeaks.edger <- topTags(y$LRT, nrow(ipeaks))$table 
dbpeaks.edger <- subset(dbpeaks.edger, FDR < 0.1)

> nrow(dbpeaks.edger)
[1] 829


DESeq2 Analysis

dds <- DESeqDataSetFromMatrix(ipeaks, colData= samples, design = ~Treatment + Condition)
# Differential Binding analysis
dds <- DESeq(dds)
res <- results(dds)
# Reporting results
dbpeaks.deseq2 <- subset(res, padj< 0.1)

> nrow(dbpeaks.deseq2)
[1] 5



R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS release 6.5 (Final)

[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8   
[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] csaw_1.4.1                 rtracklayer_1.30.1         clusterProfiler_2.4.3      ChIPQC_1.6.1             
[5] DiffBind_1.16.1            RSQLite_1.0.0              DBI_0.3.1                  locfit_1.5-9.1           
[9] GenomicAlignments_1.6.3    Rsamtools_1.22.0           plyr_1.8.3                 ggplot2_2.0.0            
[13] matrixStats_0.50.1         preprocessCore_1.32.0      biomaRt_2.26.1             igraph_1.0.1             
[17] reshape2_1.4.1             vsn_3.38.0                 geneplotter_1.48.0         annotate_1.48.0          
[21] XML_3.98-1.3               AnnotationDbi_1.32.3       lattice_0.20-33            edgeR_3.12.0             
[25] limma_3.26.3               DESeq2_1.10.1              RcppArmadillo_0.6.300.2.0  Rcpp_0.12.3              
[29] SummarizedExperiment_1.0.2 Biobase_2.30.0             Biostrings_2.38.3          XVector_0.10.0           
[33] GenomicRanges_1.22.4       GenomeInfoDb_1.6.3         IRanges_2.4.6              S4Vectors_0.8.11         
[37] BiocGenerics_0.16.1       

loaded via a namespace (and not attached):
[1] Category_2.36.0        bitops_1.0-6           httr_1.1.0             RColorBrewer_1.1-2     tools_3.2.2          
[6] R6_2.1.2               affyio_1.40.0          rpart_4.1-10           KernSmooth_2.23-15     Hmisc_3.17-1         
[11] colorspace_1.2-6       nnet_7.3-12            gridExtra_2.0.0        Nozzle.R1_1.1-1        sendmailR_1.2-1      
[16] graph_1.48.0           SparseM_1.7            topGO_2.22.0           caTools_1.17.1         scales_0.3.0         
[21] checkmate_1.7.1        BatchJobs_1.6          genefilter_1.52.1      affy_1.48.0            RBGL_1.46.0          
[26] digest_0.6.9           stringr_1.0.0          foreign_0.8-66         DOSE_2.8.2             AnnotationForge_1.12.1
[31] base64enc_0.1-3        BBmisc_1.9             BiocInstaller_1.20.1   GOstats_2.36.0         hwriter_1.3.2        
[36] BiocParallel_1.4.3     gtools_3.5.0           GOSemSim_1.28.2        acepack_1.3-3.3        RCurl_1.95-4.7       
[41] magrittr_1.5           GO.db_3.2.2            Formula_1.2-1          futile.logger_1.4.1    Matrix_1.2-3         
[46] munsell_0.4.2          stringi_1.0-1          zlibbioc_1.16.0        qvalue_2.2.2           gplots_2.17.0        
[51] fail_1.3               grid_3.2.2             DO.db_2.9              gdata_2.17.0           splines_3.2.2        
[56] GenomicFeatures_1.22.6 KEGGREST_1.10.1        rjson_0.2.15           systemPipeR_1.4.8      futile.options_1.0.0 
[61] ShortRead_1.28.0       latticeExtra_0.6-26    lambda.r_1.1.7         png_0.1-7              gtable_0.1.2         
[66] amap_0.8-14            chipseq_1.20.0         xtable_1.8-0           survival_2.38-3        pheatmap_1.0.8       
[71] cluster_2.0.3          brew_1.0-6             GSEABase_1.32.0   

diffbind • 1.7k views
Entering edit mode
Rory Stark ★ 4.1k
Last seen 24 days ago
CRUK, Cambridge, UK

Oops, I just noticed this from a while back...

I notice two ways that your comparison between the DiffBind analysis and direct analysis using the "raw" reads differ.

First, by default DiffBind subtracts the number of reads in the Control track (scaled by default). You can get these counts by using score=DBA_SCORE_READS_MINUS in your call to dba.count(), or avoid the subtraction by setting bSubControl=FALSE when calling dba.analyze(). If the subtraction of control reads has an effect, it is often to eliminate artifactual peaks from the analysis.

Secondly, by default DiffBind uses the "full" library size when calculating the normalization factors. This is the total number of reads in the .bam file, rather than the number of reads within peaks (basically the sum of each column in the count matrix). You can make DiffBind do what edgeR and DESeq2 do by default by setting bFullLibrarySize=FALSE when calling dba.analyze().

There may be some other things going on, but these two are the most obvious.




Login before adding your answer.

Traffic: 489 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6