DiffBind: Different results with different chromosome names
1
0
Entering edit mode
D.Hemerich • 0
@dhemerich-11434
Last seen 5.0 years ago

Dear all,

I'm using DiffBind to compare histone modifications in two groups of samples.

I'm comparing two cell-types, from the same 8 individuals. So I have 16 samples, 8 of each cell-type.

The aim is to find peaks that are significantly diferentially bound between the two groups of samples, and thus identify epigenetic differences between the two cell-types, using the 8 matched pairs of cell-type from the same individuals.

Whenever I analyze all 8 samples together, I notice that there are six samples that cluster in the wrong group on the heatmap and the PCA. I would expect to find two nice clusters of each cell-type, but there are these six samples that go to the wrong group. Four of these samples, by the away, are matched pairs of the same two individuals. Finally, I get only one differentially bound peak between the two groups of samples when using all 16 of them.

Then, I took those 6 samples out of the analysis and run again.
So now I have 10 samples, 5 of each cell-type.
They cluster nicely on the heatmap and on the PCA.
And there are 1594 differentially bound peaks identified with edgeR, using the default threshold of FDR <= 0.05.

However, I have one doubt:

These 1594 peaks are identified when the .bam files do not have the 'chr' string in the first column for chromosome name, but the .bed files with the peaks do have the 'chr' string and the number.
When I run the analysis with no 'chr' string in either .bed or .bam files, DiffBind finds only two differentially bound peaks, instead of 1594... so I guess it doesn't work.

Even the FRiP value is generated for each peakset both when .bam and .bed chromosome columns are mismatched (.bed with 'chr'+number and .bam with no 'chr', just number) and also when they match, but only the differentially bound peaks are not identified when the files match. I'm puzzled by this.

Any insight, suggestion or opinion will be of great help.

Code:

c5 <- dba(sampleSheet=read.csv("c5.csv"))
c5

# 10 Samples, 50849 sites in matrix (83459 total):
#   ID Tissue  Factor  Condition Replicate Caller Intervals
# 1  A1     LV H3K27ac a         1    bed     37472
# 2  A2     LV H3K27ac a         2    bed     32165
# 3  A3     LV H3K27ac a         3    bed     43512
# 4  A4     LV H3K27ac a         4    bed     46404
# 5  A5     LV H3K27ac a         5    bed     29335
# 6  B1     RV H3K27ac b         1    bed     22295
# 7  B2     RV H3K27ac b         2    bed     18116
# 8  B3     RV H3K27ac b         3    bed     47573
# 9  B4     RV H3K27ac b         4    bed     49963
# 10 B5     RV H3K27ac b         5    bed     38031

c5 <- dba.count(c5)
c5

# 10 Samples, 50849 sites in matrix:
#   ID Tissue  Factor  Condition Replicate Caller Intervals FRiP
# 1  A1     LV H3K27ac a         1 counts     50849 0.23
# 2  A2     LV H3K27ac a         2 counts     50849 0.20
# 3  A3     LV H3K27ac a         3 counts     50849 0.24
# 4  A4     LV H3K27ac a         4 counts     50849 0.30
# 5  A5     LV H3K27ac a         5 counts     50849 0.18
# 6  B1     RV H3K27ac b         1 counts     50849 0.15
# 7  B2     RV H3K27ac b         2 counts     50849 0.13
# 8  B3     RV H3K27ac b         3 counts     50849 0.24
# 9  B4     RV H3K27ac b         4 counts     50849 0.29
# 10 B5     RV H3K27ac b         5 counts     50849 0.20

c5 <- dba.contrast(c5, categories=DBA_CONDITION, minMembers=2)
c5 <- dba.analyze(c5, method=DBA_ALL_METHODS)
c5

# 1 Contrast:
#   Group1 Members1 Group2 Members2 DB.edgeR DB.DESeq DB.DESeq2
# 1     a        5     b        5     1594     1437      1366


Thank you in advance,
Daiane.

 

> sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] DiffBind_1.16.3            RSQLite_1.0.0
 [3] DBI_0.5                    locfit_1.5-9.1
 [5] GenomicAlignments_1.6.3    Rsamtools_1.25.0
 [7] Biostrings_2.38.4          XVector_0.10.0
 [9] limma_3.29.14              SummarizedExperiment_1.0.2
[11] Biobase_2.33.0             GenomicRanges_1.22.4
[13] GenomeInfoDb_1.6.3         IRanges_2.4.8
[15] S4Vectors_0.8.11           BiocGenerics_0.19.1

loaded via a namespace (and not attached):
 [1] edgeR_3.12.1              splines_3.2.3
 [3] gtools_3.5.0              Formula_1.2-1
 [5] latticeExtra_0.6-28       amap_0.8-14
 [7] RBGL_1.49.1               Category_2.39.0
 [9] backports_1.0.3           lattice_0.20-33
[11] chron_2.3-47              digest_0.6.10
[13] RColorBrewer_1.1-2        checkmate_1.8.1
[15] colorspace_1.2-6          Matrix_1.2-6
[17] plyr_1.8.4                GSEABase_1.35.0
[19] DESeq2_1.10.1             XML_3.98-1.4
[21] pheatmap_1.0.8            ShortRead_1.28.0
[23] biomaRt_2.29.2            genefilter_1.52.1
[25] zlibbioc_1.19.0           xtable_1.8-2
[27] GO.db_3.2.2               scales_0.4.0
[29] brew_1.0-6                gdata_2.17.0
[31] BiocParallel_1.7.4        annotate_1.51.0
[33] ggplot2_2.1.0             GenomicFeatures_1.22.13
[35] nnet_7.3-12               survival_2.39-5
[37] magrittr_1.5              systemPipeR_1.4.8
[39] fail_1.3                  gplots_3.0.1
[41] RcppArmadillo_0.7.200.2.0 hwriter_1.3.2
[43] foreign_0.8-66            GOstats_2.36.0
[45] graph_1.51.0              data.table_1.9.6
[47] tools_3.2.3               BBmisc_1.10
[49] stringr_1.0.0             sendmailR_1.2-1
[51] munsell_0.4.3             cluster_2.0.4
[53] AnnotationDbi_1.32.3      DESeq_1.22.1
[55] caTools_1.17.1            grid_3.2.3
[57] RCurl_1.95-4.8            rjson_0.2.15
[59] AnnotationForge_1.12.2    bitops_1.0-6
[61] base64enc_0.1-3           gtable_0.2.0
[63] gridExtra_2.2.1           rtracklayer_1.30.4
[65] Hmisc_3.17-4              KernSmooth_2.23-15
[67] stringi_1.1.1             BatchJobs_1.6
[69] Rcpp_0.12.6               geneplotter_1.48.0
[71] rpart_4.1-10              acepack_1.3-3.3

diffbind chip-seq • 628 views
ADD COMMENT
0
Entering edit mode
Rory Stark ★ 4.1k
@rory-stark-5741
Last seen 19 hours ago
CRUK, Cambridge, UK

Hi Daiane-

In general, DiffBind does want the chromosomes to be exactly matched in the bam files (reads) and the bed files (peaks).

If you send me a copy of (or link to) your DBA objects, preferably before and after calling dba.count(), for both the matched and unmatched case, I can have a look at exactly what it is doing. We did change the way DiffBind handles chromosome names in the more recent release (Bioconductor 3.3/DiffBind 2.0), and I will be looking at how it behaves in that release.

Regards-

Rory

rory.stark@cruk.cam.ac.uk

 

ADD COMMENT

Login before adding your answer.

Traffic: 310 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6