Question

question about DiffBind

0

Entering edit mode

February • 0

@february-8645

Last seen 8.2 years ago

hi Rory,

I'm a new with DiffBind for differential binding analysis of ChIP-Seq peak data.I want to write command line for my work.But always print error.I really don't understant with dba.contrast and dba.mask.

I have 4 samples with chip-experiment,they are A_K2、B_T1、B_T3 and B_H3.And I have got 4 file peaks for every sample with Input sample(no-chip).

Now,I want get the differential binding of 3 combinations,A_K2vsB_T1、A_K2vsB_T3 and A_K2vsB_H3.My DBA sampleSheet is as follow:

> read.csv("sample.csv")
  SampleID     Tissue Factor  Condition Treatment Replicate
1     A_K2 A_K2vsB_T1     ER  Resistant      Full         1
2     B_T1 A_K2vsB_T1     ER Responsive      Full         1
3     B_T3 A_K2vsB_T3     ER Responsive      Full         1
4     B_H3 A_K2vsB_H3     ER Responsive      Full         1
               bamReads ControlID               bamControl
1 Sample_150079A_K2.bam   B_Input Sample_150079B_Input.bam
2 Sample_150079B_T1.bam   B_Input Sample_150079B_Input.bam
3 Sample_150079B_T3.bam   B_Input Sample_150079B_Input.bam
4 Sample_150079B_H3.bam   B_Input Sample_150079B_Input.bam
                         Peaks PeakCaller PeakFormat
1 run_macs14_narrow1_peaks.bed        bed        bed
2 run_macs14_narrow2_peaks.bed        bed        bed
3 run_macs14_narrow3_peaks.bed        bed        bed
4 run_macs14_narrow4_peaks.bed        bed        bed

>diff <- dba(sampleSheet="sample.csv")

>diff_count <- dba.count(diff,minOverlap=1)

I really need your help about dba.contrast and dba.analyze.

Thank you so much.

February Fang

DiffBind differential binding analysis • 5.8k views

ADD COMMENT • link updated 8.6 years ago by Rory Stark ★ 5.1k • written 8.7 years ago by February • 0

score 1 · Answer 1 · 2015-08-26

Hello-

There is a fundamental problem with your experimental design, namely the lack of replication. You can not use the statistical techniques underlying DiffBind (embodied in the edgeR and DESeq2 packages) with only a single sample on each side of a contrast. Without replicates, and some way to assess variance, confidence statistics can not be reliably calculated. You can look at overlaps (ie Venn diagrams using dba.plotVenn()) of the peak calls before calling dba.count(), and you can look at clustering (heatmaps and PCAs using dba.plotHeatmap() and dba.plotPCA()) of the samples after counting.

You could use an analysis to rank peaks by, eg, Fold change. You can get a report using DESeq2 as follows:

> diff_analysis <- dba.contrast(diff_count,
                                group1=1, name1="A_K2",
                                group2=2, name2="B_T1")
> diff_analysis <- dba.contrast(diff_analysis,
                                group1=1, name1="A_K2",
                                group2=3, name2="B_T3")
> diff_analysis <- dba.contrast(diff_analysis,
                                group1=1, name1="A_K2",
                                group2=4, name2="B_H3")
> diff_analysis <- dba.analyze(diff_analysis,method=DBA_DESEQ2)
> rep1 <- dba.report(diff_analysis, contrast=1, th=1, bCount=TRUE)
> rep2 <- dba.report(diff_analysis, contrast=2, th=1, bCount=TRUE)
> rep3 <- dba.report(diff_analysis, contrast=3, th=1, bCount=TRUE)

Hope this helps!

-Rory

score 0 · Answer 2 · 2015-08-30

hi Rory

Thank you for your advice and details. It really helped me a lot.

And here I havd another question about dba.count().In DiffBind reference manual,you had saied "DiffBind can use the supplied sequence read files to count how many reads overlap each interval for each unique sample. The result of this is a binding affinity matrix containing a (normalized) read count for each sample at every potential binding site." Can I understand it as readcount? You know,the output of MACS has a '*peaks.xls' file with a column named 'tags'. Why you use the alignment bam instead of 'tags'?

score 0 · Answer 3 · 2015-09-01

Hi again-

For the binding matrix, there has to be a single set of "consensus" peaks (rows in the matrix, where the samples are columns). In most cases this set of peaks will not correspond to any single set of peaks generated by eg MACS. In the default case where a consensus peak set is generated using all the peaks that overlap in at least two samples, this will include a) merged peaks where the overlapping peaks don't have the exact same coordinates and b) peaks that were called in some samples but not others. In these cases, there will not be a corresponding MACS tag count.

For these reasons amongst others (including not wanting to rely on a single peak caller, and a desire to count reads from control tracks), dba.count() re-counts the overlapping reads for every consensus peak for every sample, whether or not that peak was identified in that sample.

Cheers-

Rory

score 0 · Answer 4 · 2015-09-01

hi My command line as follows:

>diff <- dba(sampleSheet="sample.csv")

> head(diff$vectors)
CHR START END A_K2 B_H3
1 1 24611508 24613285 0.66803548 0.51411935
2 1 172350354 172351110 0.02988387 0.02449355
3 1 183299111 183299590 0.11737097 0.23100323

> head(diff$allvectors)
CHR START END A_K2 B_H3
1 1 5134468 5134943 -1.00000000 0.02549355
2 1 9347112 9347911 -1.00000000 0.03838710
3 1 9747914 9748482 0.02663548 -1.00000000

>diff_count <- dba.count(diff)

> head(diff_count$allvectors)
CHR START END A_K2 B_H3
1 1 24611508 24613285 12.387251 0.9768695
2 1 172350354 172351110 27.871314 21.4911280
3 1 183299111 183299590 1.032271 39.0747781

> diff_count$peaks
[[1]]
Chr Start End Score RPKM Reads
1 chr1 24611508 24613285 12.387251 1.681636e+01 436
2 chr1 172350354 172351110 27.871314 2.719772e+00 30
3 chr1 183299111 183299590 1.032271 8.871339e+00 62
cRPKM cReads
1 1.632206e+01 424
2 2.339358e-01 3
3 9.045844e+00 64

I found that diff$allvectors contained all of the peaks when diff$vectors contained only overlap peaks. But diff_count$allvectors also only contained overlap peaks after dba.count().Why "diffbind" only concentrate on the overlap peak ? In addition,I want to know how to caculate "cRPKM" and "cReads" ,I think that these two value come from the input sample controled.

score 0 · Answer 5 · 2015-09-10

You are correct that diff$allvectors contains all the peaks (merging overlapping ones) while diff$vectors contains the matrix used for subsequent analysis. Using the default value of minOverlap=2, the binding matrix will only contain peaks overlap in at least two samples. By setting minOverlap=1, these two should contain the same sets of peaks (ie nrow(diff$allvectors) == nrow(diff$vectors)), which may be what you want. Overlapping peaks will still be merged, so they may not correspond exactly to the peaks in the original peaksets. You can actually set the set of peaks used in the binding matrix to anything you want using the peaks parameter to dba.count().

cReads is the number of control reads that overlap a peak for a sample. Note that this is given a minimum value of 1 read. cRPKM is just the control read count normalized using the RPKM formula (cReads divided by the peak width divided by 1000 divided by the library size divided by 1M).

-Rory