DiffBind Batch Effect
3
1
Entering edit mode
@niushengyong-12910
Last seen 7.5 years ago

I'm now running ATAC-seq differential analysis by DiffBind with peaks from MACS2. I have 10 samples from batch 3 and 20 samples from batch 5. How could I eliminate batch effect and normalize them in DiffBind? Sorry, I'm quite new in this area. Thanks!

Simon

diffbind atac • 5.8k views
ADD COMMENT
2
Entering edit mode
Rory Stark ★ 5.2k
@rory-stark-5741
Last seen 28 days ago
Cambridge, UK

If you see a batch effect (eg in the PCA plots), the best way to handle them is to model them. We don't want to adjust the count matrix directly as that violates the assumption of the differential analyses in DESeq2 and edgeR

To model the batch, use the block parameter in dba.contrast(). For example, if the batch numbers were stored in the Replicate field, you would say block=DBA_REPLICATE. Then when you run dba.analyze(), it will do the analysis two ways: as a single-factor analysis, and with the "blocking" factor representing the batch. To use the results from the blocked analysis, you need to specify method=DBA_DESEQ2_BLOCK (or method=DBA_EDGER_BLOCK) when calling dba.report() or any of the plotting functions.

-Rory

ADD COMMENT
0
Entering edit mode

I have 8 samples from 2 different batches without replicates. How could I use the block argument?

Thanks!

ADD REPLY
3
Entering edit mode

Hopefully you have some replicates in the sense that multiple samples represent alternative experiments of the same condition.

The first step is to mark the batch using one of the metadata fields. You could use for example Condition, Treatment, or Replicate. Create a column in the sample sheet with one of these names, and for the sample in the first batch set the value to "1" and for the samples in the second batch set them to "2" (or "Batch1"and "Batch2" if you are not using the Replicate field).

For example, if the comparison you want to make is between two conditions, you would set the condition labels for each sample in the Condition column, and the corresponding batch number in the Replicate column. Then, after you have loaded the sample sheet using dba() and computed the read counts for the consensus set using dba.count(), you call dba.contrast() and dba.analyze():

> myDBA <- dba.contrast(myDBA, categories=DBA_CONDITION, block=DBA_REPLICATE)
> myDBA <- dba.analyze(myDBA)
> myDBA

In the contrast listing for myDBA, you should see values for two analyses: DB.DESeq2 and DB.DESEq2.block. The first is re[presents the results for a single-factor analysis, while the second reflects the additional modelling of the batch effect.

To use the results of the multi-factor analysis, you need to specify method=DBA_DESEQ2_BLOCK, eg:

> dba.plotMA(myDBA, method=DBA_DESEQ2_BLOCK)
> DBresults <- dba.report(myDBA, method=DBA_DESEQ2_BLOCK)
> plot(myDBA, contrast=1, method=DBA_DESEQ2_BLOCK)
> dba.plotPCA(myDBA, contrast=1, method=DBA_DESEQ2_BLOCK)

Some other tips:

  • You can set the name of the metadata field you are using to represent he batch in subsequent plots by setting: myDBA$config$replicate <- "Batch"
  • You can use the multi-factor analysis by default (don't need to specify the method= parameter every time) by setting: myDBA$config$AnalysisMethod <- DBA_DESEQ2_BLOCK

-Rory

ADD REPLY
0
Entering edit mode

Hi Rory,

When I follow your suggestions, it shows the error: "Error in pv.DBAreport(pv=DBA, contrast = contrast, method = method: DESeq2 analysis has not been run for this contrast."

My codes:

ta <- dba(sampleSheet = "diff.csv")
ta <- dba.count(ta, summits=75)
ta = dba.contrast(ta, categories=DBA_CONDITION, minMembers = 2, block=DBA_REPLICATE)
ta <- dba.contrast(ta)
ta = dba.analyze(ta, method=DBA_DESEQ2_BLOCK)
ta.DB = dba.report(ta, contrast=1, method=DBA_DESEQ2_BLOCK, bCounts=TRUE)

It also shows the following output when I type "ta" after the second dba.contrast()

Group1 Members1 Group2 Member2

 Control               3          Test           5

 

Simon

ADD REPLY
0
Entering edit mode

I have 2 batches. One is "batch 3" and another one is "batch 5", so I put numerical value 3 or 5 in the Replicate column. 

ADD REPLY
0
Entering edit mode

Hi Simon-

My example code had an error (I've fixed it above). The second call to dba.contrast() should be a call to dba.analyze(). So in your code you should remove the second call to dba.contrast() and remove the "method=DBA_DESEQ2_BLOCK" parameter from the call to dba.analyze():

ta <- dba(sampleSheet = "diff.csv")
ta <- dba.count(ta, summits=75)
ta <- dba.contrast(ta, categories=DBA_CONDITION, minMembers = 2, 
                   block=DBA_REPLICATE)
ta <- dba.analyze(ta)
ta.DB <- dba.report(ta, contrast=1, method=DBA_DESEQ2_BLOCK, bCounts=TRUE)
ADD REPLY
0
Entering edit mode
@niushengyong-12910
Last seen 7.5 years ago

Hi Rory,

I get the message after dba.report:

Error in `colnames<-`(`*tmp*`, value = c("Chr", "Start", "End", "Conc",  : 
  'names' attribute [9] must be the same length as the vector [7]

I use the version of 2.2.12 and I knew it may be the version issue. Is there any workaround for it if I still need to use version 2.2.12? Thanks!

ADD COMMENT
0
Entering edit mode

If you email the DBA object (ta.DB) to me (or send me a link where I can download it), I can see what is going on. Email is on Bioconductor landing page for DiffBind.

-Rory

ADD REPLY
0
Entering edit mode

Just emailed to you. Thanks!

ADD REPLY
0
Entering edit mode
@niushengyong-12910
Last seen 7.5 years ago

It works when I tried in another version; however, there is another error now:

It shows the following error when I try to plot heatmap or PCA plot. "Error in pv.getPlotData(DBA, attributes = attributes, contrast = contrast,  :  Only one site to plot -- need at least 2!".

 

Thanks!

ADD COMMENT
0
Entering edit mode
ADD REPLY
0
Entering edit mode

Thanks Rory! After checking the PCA plot and correlation plot, I find there is still quite significant batch effect. Do you have any advice? Thanks!

ADD REPLY
0
Entering edit mode

The PCA and correlation plots may be misleading in this case. Modelling the batch effect in the GLM does not change the read counts in any way. So while the model used to identify differentially bound sites takes the batch effect into account, when you plot these sites using the count matrix, they will still exhibit the batch effect.

Short of altering the read counts for the plots (eg by singular value decomposition, as is used in SVA), we can't account for the batch effect in the plots. As DiffBind uses differential analysis methods that rely on raw reads (DESeq2 and edgeR), it avoids such alterations to the count matrix.

-R

ADD REPLY
0
Entering edit mode

Hi Rory,

 

Thanks for your help! In this case, how could I evaluate if there is still significant batch effect in my dataset after use the "BLOCK" arguments you mentioned? Thanks!

ADD REPLY
0
Entering edit mode

Having modelled the batch effect, the resulting statistics should be valid. In particular, I would expect that False Positive statistics to be valid. Yo may have more False Negatives formt he increased variance, but he sites that have low FDR should be good.

-R

ADD REPLY
0
Entering edit mode

Thanks for your help!

ADD REPLY

Login before adding your answer.

Traffic: 698 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6