Question

blocking factors in diffbind for edgeR and DESeq2 analysis

0

Entering edit mode

armandorp • 0

@armandorp-9666

Last seen 3.4 years ago

Dear all,

I am doing a differential binding analysis of ChiP-Seq data for two conditions, i.e. Control and Patient. These samples were analysed in two different batches while keeping a proportional number of control and patient samples in each batch. I do not have input DNA and the sequencing data were paired-end.

I want to consider the batch effects in the model by blocking them as a factor. Thus, I decided to use DiffBind, edgeR and DESeq2 and their respective procedures for blocking factors. My results are different for the different tools and using the same matrix of read counts provided by DiffBind. I found some details about how the edgeR analysis in diffbind is done in DiffBind but I couldn’t find the detailed description for blocking or multiple-factors…

So I am not sure what I am missing, I was expecting to have similar results between Edger/DESeq2 and DiffBind, I would appreciate your comments.

Here there are some details of the sample sheet and the followed procedure for the analyses.

SampleID; Tissue; Factor; Condition; Treatment; Replicate; bamReads; Peaks; PeakCaller
sample_1; F; PAT; PAT; 1st_batch; 1; sample_1.bam; sample_1.bed; narrow
sample_2; F; PATM; PAT; 1st_batch; 1; sample_2.bam; sample_2.bed; narrow
sample_3; M; CONT; CONT; 1st_batch; 1; sample_3.bam; sample_3.bed; narrow
sample_4; F; CONT; CONT; 1st_batch; 1; sample_4.bam; sample_4.bed; narrow
sample_5; M; CONT; CONT; 1st_batch; 1; sample_5.bam; sample_5.bed; narrow
sample_6; M; PAT; PAT; 1st_batch; 1; sample_6.bam; sample_6.bed; narrow
sample_7; F; CONT; CONT; 1st_batch; 1; sample_7.bam; sample_7.bed; narrow
sample_8; M; PATM; PAT; 1st_batch; 1; sample_8.bam; sample_8.bed; narrow
sample_9; M; CONT; CONT; 1st_batch; 1; sample_9.bam; sample_9.bed; narrow
sample_10; F; CONT; CONT; 2nd_batch; 1; sample_10.bam; sample_10.bed; narrow
sample_11; F; CONT; CONT; 2nd_batch; 1; sample_11.bam; sample_11.bed; narrow
sample_12; F; PAT; PAT; 2nd_batch; 1; sample_12.bam; sample_12.bed; narrow
sample_13; F; PATM; PAT; 2nd_batch; 1; sample_13.bam; sample_13.bed; narrow
sample_14; M; PAT; PAT; 2nd_batch; 1; sample_14.bam; sample_14.bed; narrow
sample_15; M; PAT; PAT; 2nd_batch; 1; sample_15.bam; sample_15.bed; narrow

I am running DiffBind using these commands/parameters for blocking the factor related to batch effects:

diffdata <- dba(sampleSheet=samples, bCorPlot=FALSE)
diffdata$config$singleEnd <- FALSE
diffdata <- dba.count(diffdata, bCorPlot=FALSE, bUseSummarizeOverlaps=TRUE)
diffdata <- dba.contrast ( diffdata , categories = DBA_CONDITION , block = DBA_TREATMENT )
diffdata <- dba.analyze ( diffdata , bCorPlot = FALSE , method = DBA_ALL_METHODS_BLOCK )

This is the resulting report from DiffBind where the default threshold for the FDR is 0.1:

1 Contrast:
  Group1 Members1 Group2 Members2 Block1Val InBlock1 Block2Val InBlock2 DB.edgeR DB.edgeR.block DB.DESeq DB.DESeq.block
1    PAT       16   CONT       15 1st_batch       15 2nd_batch       16        1              8        0              6
  DB.DESeq2 DB.DESeq2.block
1         2               4

Then I am trying to repeat the same analysis in EDGER and DESEQ2 but using the resulting raw read counts matrix from DiffBind, and I have different numbers of peaks for edgeR.

# Raw Count from Consensus peaks from DiffBind
myDBA <- dba.count(diffdata, peaks=NULL, score=DBA_SCORE_READS, bCorPlot=FALSE)
ipeaks <- dba.peakset(myDBA, bRetrieve=TRUE, DataType=DBA_DATA_FRAME, peak.caller="narrow")
ipeaks <- ipeaks[,-c(1,2,3)]

samples$Condition <- factor(samples$Condition)
Condition <- factor(samples$Condition)
samples$Treatment <- factor(samples$Treatment)
Treatment <- factor(samples$Treatment)

design <- model.matrix(~Treatment + Condition)
rownames(design) <- samples$SampleID
rownames(samples) <- samples$SampleID

edgeR Analysis

y <- DGEList(counts = ipeaks, group = Condition)
y <- calcNormFactors(y, method="TMM")
y <- estimateCommonDisp(y)
y <- estimateGLMCommonDisp(y, design)
y$GLM <- glmFit(y,design)
y$LRT <- glmLRT(y$GLM, coef=3)
dbpeaks.edger <- topTags(y$LRT, nrow(ipeaks))$table 
dbpeaks.edger <- subset(dbpeaks.edger, FDR < 0.1)

> nrow(dbpeaks.edger)
[1] 829

DESeq2 Analysis

dds <- DESeqDataSetFromMatrix(ipeaks, colData= samples, design = ~Treatment + Condition)
# Differential Binding analysis
dds <- DESeq(dds)
res <- results(dds)
# Reporting results
dbpeaks.deseq2 <- subset(res, padj< 0.1)

> nrow(dbpeaks.deseq2)
[1] 5

sessionInfo()

R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS release 6.5 (Final)

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C
[9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base

other attached packages:
[1] csaw_1.4.1                 rtracklayer_1.30.1         clusterProfiler_2.4.3      ChIPQC_1.6.1
[5] DiffBind_1.16.1            RSQLite_1.0.0              DBI_0.3.1                  locfit_1.5-9.1
[9] GenomicAlignments_1.6.3    Rsamtools_1.22.0           plyr_1.8.3                 ggplot2_2.0.0
[13] matrixStats_0.50.1         preprocessCore_1.32.0      biomaRt_2.26.1             igraph_1.0.1
[17] reshape2_1.4.1             vsn_3.38.0                 geneplotter_1.48.0         annotate_1.48.0
[21] XML_3.98-1.3               AnnotationDbi_1.32.3       lattice_0.20-33            edgeR_3.12.0
[25] limma_3.26.3               DESeq2_1.10.1              RcppArmadillo_0.6.300.2.0 Rcpp_0.12.3
[29] SummarizedExperiment_1.0.2 Biobase_2.30.0             Biostrings_2.38.3          XVector_0.10.0
[33] GenomicRanges_1.22.4       GenomeInfoDb_1.6.3         IRanges_2.4.6              S4Vectors_0.8.11
[37] BiocGenerics_0.16.1

loaded via a namespace (and not attached):
[1] Category_2.36.0        bitops_1.0-6           httr_1.1.0             RColorBrewer_1.1-2     tools_3.2.2
[6] R6_2.1.2               affyio_1.40.0          rpart_4.1-10           KernSmooth_2.23-15     Hmisc_3.17-1
[11] colorspace_1.2-6       nnet_7.3-12            gridExtra_2.0.0        Nozzle.R1_1.1-1        sendmailR_1.2-1
[16] graph_1.48.0           SparseM_1.7            topGO_2.22.0           caTools_1.17.1         scales_0.3.0
[21] checkmate_1.7.1        BatchJobs_1.6          genefilter_1.52.1      affy_1.48.0            RBGL_1.46.0
[26] digest_0.6.9           stringr_1.0.0          foreign_0.8-66         DOSE_2.8.2             AnnotationForge_1.12.1
[31] base64enc_0.1-3        BBmisc_1.9             BiocInstaller_1.20.1   GOstats_2.36.0         hwriter_1.3.2
[36] BiocParallel_1.4.3     gtools_3.5.0           GOSemSim_1.28.2        acepack_1.3-3.3        RCurl_1.95-4.7
[41] magrittr_1.5           GO.db_3.2.2            Formula_1.2-1          futile.logger_1.4.1    Matrix_1.2-3
[46] munsell_0.4.2          stringi_1.0-1          zlibbioc_1.16.0        qvalue_2.2.2           gplots_2.17.0
[51] fail_1.3               grid_3.2.2             DO.db_2.9              gdata_2.17.0           splines_3.2.2
[56] GenomicFeatures_1.22.6 KEGGREST_1.10.1        rjson_0.2.15           systemPipeR_1.4.8      futile.options_1.0.0
[61] ShortRead_1.28.0       latticeExtra_0.6-26    lambda.r_1.1.7         png_0.1-7              gtable_0.1.2
[66] amap_0.8-14            chipseq_1.20.0         xtable_1.8-0           survival_2.38-3        pheatmap_1.0.8
[71] cluster_2.0.3          brew_1.0-6             GSEABase_1.32.0

diffbind • 2.0k views

ADD COMMENT • link updated 8.1 years ago by Rory Stark ★ 5.2k • written 8.2 years ago by armandorp • 0

score 1 · Answer 1 · 2016-03-09

Oops, I just noticed this from a while back...

I notice two ways that your comparison between the DiffBind analysis and direct analysis using the "raw" reads differ.

First, by default DiffBind subtracts the number of reads in the Control track (scaled by default). You can get these counts by using score=DBA_SCORE_READS_MINUS in your call to dba.count(), or avoid the subtraction by setting bSubControl=FALSE when calling dba.analyze(). If the subtraction of control reads has an effect, it is often to eliminate artifactual peaks from the analysis.

Secondly, by default DiffBind uses the "full" library size when calculating the normalization factors. This is the total number of reads in the .bam file, rather than the number of reads within peaks (basically the sum of each column in the count matrix). You can make DiffBind do what edgeR and DESeq2 do by default by setting bFullLibrarySize=FALSE when calling dba.analyze().

There may be some other things going on, but these two are the most obvious.

Cheers-

Rory