question about DiffBind
5
0
Entering edit mode
February • 0
@february-8645
Last seen 8.2 years ago

hi  Rory,

I'm a new with DiffBind for di fferential binding analysis of ChIP-Seq peak data.I want to write command line for my work.But always print error.I really don't understant with dba.contrast and dba.mask.  

I have 4 samples with chip-experiment,they are A_K2、B_T1、B_T3 and B_H3.And I have got 4 file peaks for every sample with Input sample(no-chip).  

Now,I want get the diff erential binding of 3 combinations,A_K2vsB_T1、A_K2vsB_T3 and A_K2vsB_H3.My  DBA sampleSheet is as follow:

> read.csv("sample.csv") 
  SampleID     Tissue Factor  Condition Treatment Replicate
1     A_K2 A_K2vsB_T1     ER  Resistant      Full         1
2     B_T1 A_K2vsB_T1     ER Responsive      Full         1
3     B_T3 A_K2vsB_T3     ER Responsive      Full         1
4     B_H3 A_K2vsB_H3     ER Responsive      Full         1
               bamReads ControlID               bamControl
1 Sample_150079A_K2.bam   B_Input Sample_150079B_Input.bam
2 Sample_150079B_T1.bam   B_Input Sample_150079B_Input.bam
3 Sample_150079B_T3.bam   B_Input Sample_150079B_Input.bam
4 Sample_150079B_H3.bam   B_Input Sample_150079B_Input.bam
                         Peaks PeakCaller PeakFormat
1 run_macs14_narrow1_peaks.bed        bed        bed
2 run_macs14_narrow2_peaks.bed        bed        bed
3 run_macs14_narrow3_peaks.bed        bed        bed
4 run_macs14_narrow4_peaks.bed        bed        bed

>diff <- dba(sampleSheet="sample.csv")

>diff_count <- dba.count(diff,minOverlap=1)

I really need your help about dba.contrast and dba.analyze.

 

Thank you so much.

February Fang

DiffBind differential binding analysis • 5.8k views
ADD COMMENT
1
Entering edit mode
Rory Stark ★ 5.1k
@rory-stark-5741
Last seen 7 days ago
Cambridge, UK

Hello-

There is a fundamental problem with your experimental design, namely the lack of replication. You can not use the statistical techniques underlying DiffBind (embodied in the edgeR and DESeq2 packages) with only a single sample on each side of a contrast. Without replicates, and some way to assess variance, confidence statistics can not be reliably calculated. You can look at overlaps (ie Venn diagrams using dba.plotVenn()) of the peak calls before calling dba.count(), and you can look at clustering (heatmaps and PCAs using dba.plotHeatmap() and dba.plotPCA()) of the samples after counting.

You could use an analysis to rank peaks by, eg, Fold change. You can get a report using DESeq2 as follows:

> diff_analysis <- dba.contrast(diff_count,
                                group1=1, name1="A_K2",
                                group2=2, name2="B_T1")
> diff_analysis <- dba.contrast(diff_analysis,
                                group1=1, name1="A_K2",
                                group2=3, name2="B_T3")
> diff_analysis <- dba.contrast(diff_analysis,
                                group1=1, name1="A_K2",
                                group2=4, name2="B_H3")
> diff_analysis <- dba.analyze(diff_analysis,method=DBA_DESEQ2)
> rep1 <- dba.report(diff_analysis, contrast=1, th=1, bCount=TRUE)
> rep2 <- dba.report(diff_analysis, contrast=2, th=1, bCount=TRUE)
> rep3 <- dba.report(diff_analysis, contrast=3, th=1, bCount=TRUE)

Hope this helps!

-Rory

ADD COMMENT
0
Entering edit mode

Hi Rory, I would like to use Diffbind to compare two Chip-seq data sets without replicates (replicates are in preparation but not finished yet). I tried your code but the dba.analyze function gives me following arrow message:

Error in dimnames(x) <- dn :
  length of 'dimnames' [1] not equal to array extent

The diff_analysis matrix seem to be okay:

3 Samples, 4770 sites in matrix:
             ID Tissue Factor Condition Treatment Replicate Caller Intervals
1 MeDip_rep1_PP     PP  MeDip        WT       non         1    bed      4770
2 MeDip_rep1_DP     DP  MeDip        WT       non         1    bed      4770
3 MeDip_rep2_DP     DP  MeDip        WT       non         1    bed      4770

1 Contrast:
         Group1 Members1        Group2 Members2
1 MeDip_rep1_PP        1 MeDip_rep1_DP        1

Have no clue how to resolve the issue :(

 

Thanks a lot,

Flo

ADD REPLY
0
Entering edit mode
February • 0
@february-8645
Last seen 8.2 years ago

hi Rory

Thank you for your advice and details. It really helped me a lot.

And here I havd another question about dba.count().In DiffBind reference manual,you had saied "DiffBind can use the supplied sequence read files to count how many reads overlap each interval for each unique sample. The result of this is a binding affinity matrix containing a (normalized) read count for each sample at every potential binding site." Can I understand it as readcount? You know,the output of MACS has a '*peaks.xls' file with a column named 'tags'. Why you use the alignment bam instead of 'tags'?

ADD COMMENT
0
Entering edit mode
Rory Stark ★ 5.1k
@rory-stark-5741
Last seen 7 days ago
Cambridge, UK

Hi again-

For the binding matrix, there has to be a single set of "consensus" peaks (rows in the matrix, where the samples are columns). In most cases this set of peaks will not correspond to any single set of peaks generated by eg MACS.  In the default case where a consensus peak set is generated using all the peaks that overlap in at least two samples, this will include a) merged peaks where the overlapping peaks don't have the exact same coordinates and b) peaks that were called in some samples but not others. In these cases, there will not be a corresponding MACS tag count. 

For these reasons amongst others (including not wanting to rely on a single peak caller, and a desire to count reads from control tracks), dba.count() re-counts the overlapping reads for every consensus peak for every sample, whether or not that peak was identified in that sample.

Cheers-

Rory

ADD COMMENT
0
Entering edit mode
February • 0
@february-8645
Last seen 8.2 years ago

hi My command line as follows:

>diff <- dba(sampleSheet="sample.csv")

> head(diff$vectors)
    CHR     START       END       A_K2       B_H3
1     1  24611508  24613285 0.66803548 0.51411935
2     1 172350354 172351110 0.02988387 0.02449355
3     1 183299111 183299590 0.11737097 0.23100323

> head(diff$allvectors)
  CHR    START      END        A_K2        B_H3
1   1  5134468  5134943 -1.00000000  0.02549355
2   1  9347112  9347911 -1.00000000  0.03838710
3   1  9747914  9748482  0.02663548 -1.00000000

>diff_count <- dba.count(diff)

> head(diff_count$allvectors)
  CHR     START       END      A_K2       B_H3
1   1  24611508  24613285 12.387251  0.9768695
2   1 172350354 172351110 27.871314 21.4911280
3   1 183299111 183299590  1.032271 39.0747781

> diff_count$peaks
[[1]]
              Chr     Start       End      Score         RPKM Reads
1            chr1  24611508  24613285  12.387251 1.681636e+01   436
2            chr1 172350354 172351110  27.871314 2.719772e+00    30
3            chr1 183299111 183299590   1.032271 8.871339e+00    62
           cRPKM cReads
1   1.632206e+01    424
2   2.339358e-01      3
3   9.045844e+00     64

 

I found that diff$allvectors contained all of the peaks when diff$vectors contained only overlap peaks. But diff_count$allvectors also only contained overlap peaks after dba.count().Why "diffbind" only concentrate on the overlap peak ? In addition,I want to know how to caculate "cRPKM" and "cReads" ,I think that these two value come from the input sample controled.

ADD COMMENT
0
Entering edit mode
Rory Stark ★ 5.1k
@rory-stark-5741
Last seen 7 days ago
Cambridge, UK

You are correct that diff$allvectors contains all the peaks (merging overlapping ones) while diff$vectors contains the matrix used for subsequent analysis. Using the default value of minOverlap=2,  the binding matrix will only contain peaks overlap in at least two samples. By setting minOverlap=1, these two should contain the same sets of peaks (ie nrow(diff$allvectors) == nrow(diff$vectors)), which may be what you want. Overlapping peaks will still be merged, so they may not correspond exactly to the peaks in the original peaksets.  You can actually set the set of peaks used in the binding matrix to anything you want using the peaks parameter to dba.count().

cReads is the number of control reads that overlap a peak for a sample. Note that this is given a minimum value of 1 read. cRPKM is just the control read count normalized using the RPKM formula (cReads divided by the peak width divided by 1000 divided by the library size divided by 1M).

-Rory

ADD COMMENT

Login before adding your answer.

Traffic: 727 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6