Dear all,
I am doing a differential binding analysis of ChiP-Seq data for two conditions, i.e. Control and Patient. These samples were analysed in two different batches while keeping a proportional number of control and patient samples in each batch. I do not have input DNA and the sequencing data were paired-end.
I want to consider the batch effects in the model by blocking them as a factor. Thus, I decided to use DiffBind, edgeR and DESeq2 and their respective procedures for blocking factors. My results are different for the different tools and using the same matrix of read counts provided by DiffBind. I found some details about how the edgeR analysis in diffbind is done in DiffBind but I couldn’t find the detailed description for blocking or multiple-factors…
So I am not sure what I am missing, I was expecting to have similar results between Edger/DESeq2 and DiffBind, I would appreciate your comments.
Here there are some details of the sample sheet and the followed procedure for the analyses.
SampleID; Tissue; Factor; Condition; Treatment; Replicate; bamReads; Peaks; PeakCaller
sample_1; F; PAT; PAT; 1st_batch; 1; sample_1.bam; sample_1.bed; narrow
sample_2; F; PATM; PAT; 1st_batch; 1; sample_2.bam; sample_2.bed; narrow
sample_3; M; CONT; CONT; 1st_batch; 1; sample_3.bam; sample_3.bed; narrow
sample_4; F; CONT; CONT; 1st_batch; 1; sample_4.bam; sample_4.bed; narrow
sample_5; M; CONT; CONT; 1st_batch; 1; sample_5.bam; sample_5.bed; narrow
sample_6; M; PAT; PAT; 1st_batch; 1; sample_6.bam; sample_6.bed; narrow
sample_7; F; CONT; CONT; 1st_batch; 1; sample_7.bam; sample_7.bed; narrow
sample_8; M; PATM; PAT; 1st_batch; 1; sample_8.bam; sample_8.bed; narrow
sample_9; M; CONT; CONT; 1st_batch; 1; sample_9.bam; sample_9.bed; narrow
sample_10; F; CONT; CONT; 2nd_batch; 1; sample_10.bam; sample_10.bed; narrow
sample_11; F; CONT; CONT; 2nd_batch; 1; sample_11.bam; sample_11.bed; narrow
sample_12; F; PAT; PAT; 2nd_batch; 1; sample_12.bam; sample_12.bed; narrow
sample_13; F; PATM; PAT; 2nd_batch; 1; sample_13.bam; sample_13.bed; narrow
sample_14; M; PAT; PAT; 2nd_batch; 1; sample_14.bam; sample_14.bed; narrow
sample_15; M; PAT; PAT; 2nd_batch; 1; sample_15.bam; sample_15.bed; narrow
I am running DiffBind using these commands/parameters for blocking the factor related to batch effects:
diffdata <- dba(sampleSheet=samples, bCorPlot=FALSE) diffdata$config$singleEnd <- FALSE diffdata <- dba.count(diffdata, bCorPlot=FALSE, bUseSummarizeOverlaps=TRUE) diffdata <- dba.contrast ( diffdata , categories = DBA_CONDITION , block = DBA_TREATMENT ) diffdata <- dba.analyze ( diffdata , bCorPlot = FALSE , method = DBA_ALL_METHODS_BLOCK )
This is the resulting report from DiffBind where the default threshold for the FDR is 0.1:
1 Contrast: Group1 Members1 Group2 Members2 Block1Val InBlock1 Block2Val InBlock2 DB.edgeR DB.edgeR.block DB.DESeq DB.DESeq.block 1 PAT 16 CONT 15 1st_batch 15 2nd_batch 16 1 8 0 6 DB.DESeq2 DB.DESeq2.block 1 2 4
Then I am trying to repeat the same analysis in EDGER and DESEQ2 but using the resulting raw read counts matrix from DiffBind, and I have different numbers of peaks for edgeR.
# Raw Count from Consensus peaks from DiffBind myDBA <- dba.count(diffdata, peaks=NULL, score=DBA_SCORE_READS, bCorPlot=FALSE) ipeaks <- dba.peakset(myDBA, bRetrieve=TRUE, DataType=DBA_DATA_FRAME, peak.caller="narrow") ipeaks <- ipeaks[,-c(1,2,3)] samples$Condition <- factor(samples$Condition) Condition <- factor(samples$Condition) samples$Treatment <- factor(samples$Treatment) Treatment <- factor(samples$Treatment) design <- model.matrix(~Treatment + Condition) rownames(design) <- samples$SampleID rownames(samples) <- samples$SampleID
edgeR Analysis
y <- DGEList(counts = ipeaks, group = Condition)
y <- calcNormFactors(y, method="TMM")
y <- estimateCommonDisp(y)
y <- estimateGLMCommonDisp(y, design)
y$GLM <- glmFit(y,design)
y$LRT <- glmLRT(y$GLM, coef=3)
dbpeaks.edger <- topTags(y$LRT, nrow(ipeaks))$table
dbpeaks.edger <- subset(dbpeaks.edger, FDR < 0.1)
> nrow(dbpeaks.edger)
[1] 829
DESeq2 Analysis
dds <- DESeqDataSetFromMatrix(ipeaks, colData= samples, design = ~Treatment + Condition)
# Differential Binding analysis
dds <- DESeq(dds)
res <- results(dds)
# Reporting results
dbpeaks.deseq2 <- subset(res, padj< 0.1)
> nrow(dbpeaks.deseq2)
[1] 5
sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS release 6.5 (Final)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] csaw_1.4.1 rtracklayer_1.30.1 clusterProfiler_2.4.3 ChIPQC_1.6.1
[5] DiffBind_1.16.1 RSQLite_1.0.0 DBI_0.3.1 locfit_1.5-9.1
[9] GenomicAlignments_1.6.3 Rsamtools_1.22.0 plyr_1.8.3 ggplot2_2.0.0
[13] matrixStats_0.50.1 preprocessCore_1.32.0 biomaRt_2.26.1 igraph_1.0.1
[17] reshape2_1.4.1 vsn_3.38.0 geneplotter_1.48.0 annotate_1.48.0
[21] XML_3.98-1.3 AnnotationDbi_1.32.3 lattice_0.20-33 edgeR_3.12.0
[25] limma_3.26.3 DESeq2_1.10.1 RcppArmadillo_0.6.300.2.0 Rcpp_0.12.3
[29] SummarizedExperiment_1.0.2 Biobase_2.30.0 Biostrings_2.38.3 XVector_0.10.0
[33] GenomicRanges_1.22.4 GenomeInfoDb_1.6.3 IRanges_2.4.6 S4Vectors_0.8.11
[37] BiocGenerics_0.16.1
loaded via a namespace (and not attached):
[1] Category_2.36.0 bitops_1.0-6 httr_1.1.0 RColorBrewer_1.1-2 tools_3.2.2
[6] R6_2.1.2 affyio_1.40.0 rpart_4.1-10 KernSmooth_2.23-15 Hmisc_3.17-1
[11] colorspace_1.2-6 nnet_7.3-12 gridExtra_2.0.0 Nozzle.R1_1.1-1 sendmailR_1.2-1
[16] graph_1.48.0 SparseM_1.7 topGO_2.22.0 caTools_1.17.1 scales_0.3.0
[21] checkmate_1.7.1 BatchJobs_1.6 genefilter_1.52.1 affy_1.48.0 RBGL_1.46.0
[26] digest_0.6.9 stringr_1.0.0 foreign_0.8-66 DOSE_2.8.2 AnnotationForge_1.12.1
[31] base64enc_0.1-3 BBmisc_1.9 BiocInstaller_1.20.1 GOstats_2.36.0 hwriter_1.3.2
[36] BiocParallel_1.4.3 gtools_3.5.0 GOSemSim_1.28.2 acepack_1.3-3.3 RCurl_1.95-4.7
[41] magrittr_1.5 GO.db_3.2.2 Formula_1.2-1 futile.logger_1.4.1 Matrix_1.2-3
[46] munsell_0.4.2 stringi_1.0-1 zlibbioc_1.16.0 qvalue_2.2.2 gplots_2.17.0
[51] fail_1.3 grid_3.2.2 DO.db_2.9 gdata_2.17.0 splines_3.2.2
[56] GenomicFeatures_1.22.6 KEGGREST_1.10.1 rjson_0.2.15 systemPipeR_1.4.8 futile.options_1.0.0
[61] ShortRead_1.28.0 latticeExtra_0.6-26 lambda.r_1.1.7 png_0.1-7 gtable_0.1.2
[66] amap_0.8-14 chipseq_1.20.0 xtable_1.8-0 survival_2.38-3 pheatmap_1.0.8
[71] cluster_2.0.3 brew_1.0-6 GSEABase_1.32.0