Question: DiffBind Batch Effect
0
2.1 years ago by
niu.shengyong0 wrote:

I'm now running ATAC-seq differential analysis by DiffBind with peaks from MACS2. I have 10 samples from batch 3 and 20 samples from batch 5. How could I eliminate batch effect and normalize them in DiffBind? Sorry, I'm quite new in this area. Thanks!

Simon

diffbind atac • 1.3k views
modified 2.0 years ago • written 2.1 years ago by niu.shengyong0
Answer: DiffBind Batch Effect
0
2.1 years ago by
Rory Stark2.8k
CRUK, Cambridge, UK
Rory Stark2.8k wrote:

If you see a batch effect (eg in the PCA plots), the best way to handle them is to model them. We don't want to adjust the count matrix directly as that violates the assumption of the differential analyses in DESeq2 and edgeR

To model the batch, use the block parameter in dba.contrast(). For example, if the batch numbers were stored in the Replicate field, you would say block=DBA_REPLICATE. Then when you run dba.analyze(), it will do the analysis two ways: as a single-factor analysis, and with the "blocking" factor representing the batch. To use the results from the blocked analysis, you need to specify method=DBA_DESEQ2_BLOCK (or method=DBA_EDGER_BLOCK) when calling dba.report() or any of the plotting functions.

-Rory

I have 8 samples from 2 different batches without replicates. How could I use the block argument?

Thanks!

1

Hopefully you have some replicates in the sense that multiple samples represent alternative experiments of the same condition.

The first step is to mark the batch using one of the metadata fields. You could use for example Condition, Treatment, or Replicate. Create a column in the sample sheet with one of these names, and for the sample in the first batch set the value to "1" and for the samples in the second batch set them to "2" (or "Batch1"and "Batch2" if you are not using the Replicate field).

For example, if the comparison you want to make is between two conditions, you would set the condition labels for each sample in the Condition column, and the corresponding batch number in the Replicate column. Then, after you have loaded the sample sheet using dba() and computed the read counts for the consensus set using dba.count(), you call dba.contrast() and dba.analyze():

> myDBA <- dba.contrast(myDBA, categories=DBA_CONDITION, block=DBA_REPLICATE)
> myDBA <- dba.analyze(myDBA)
> myDBA

In the contrast listing for myDBA, you should see values for two analyses: DB.DESeq2 and DB.DESEq2.block. The first is re[presents the results for a single-factor analysis, while the second reflects the additional modelling of the batch effect.

To use the results of the multi-factor analysis, you need to specify method=DBA_DESEQ2_BLOCK, eg:

> dba.plotMA(myDBA, method=DBA_DESEQ2_BLOCK)
> DBresults <- dba.report(myDBA, method=DBA_DESEQ2_BLOCK)
> plot(myDBA, contrast=1, method=DBA_DESEQ2_BLOCK)
> dba.plotPCA(myDBA, contrast=1, method=DBA_DESEQ2_BLOCK)

Some other tips:

• You can set the name of the metadata field you are using to represent he batch in subsequent plots by setting: myDBA$config$replicate <- "Batch"
• You can use the multi-factor analysis by default (don't need to specify the method= parameter every time) by setting: myDBA$config$AnalysisMethod <- DBA_DESEQ2_BLOCK

-Rory

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by Rory Stark2.8k

Hi Rory,

When I follow your suggestions, it shows the error: "Error in pv.DBAreport(pv=DBA, contrast = contrast, method = method: DESeq2 analysis has not been run for this contrast."

My codes:

ta <- dba(sampleSheet = "diff.csv")
ta <- dba.count(ta, summits=75)
ta = dba.contrast(ta, categories=DBA_CONDITION, minMembers = 2, block=DBA_REPLICATE)
ta <- dba.contrast(ta)
ta = dba.analyze(ta, method=DBA_DESEQ2_BLOCK)
ta.DB = dba.report(ta, contrast=1, method=DBA_DESEQ2_BLOCK, bCounts=TRUE)

It also shows the following output when I type "ta" after the second dba.contrast()

Group1 Members1 Group2 Member2

Control               3          Test           5

Simon

I have 2 batches. One is "batch 3" and another one is "batch 5", so I put numerical value 3 or 5 in the Replicate column.

Hi Simon-

My example code had an error (I've fixed it above). The second call to dba.contrast() should be a call to dba.analyze(). So in your code you should remove the second call to dba.contrast() and remove the "method=DBA_DESEQ2_BLOCK" parameter from the call to dba.analyze():

ta <- dba(sampleSheet = "diff.csv")
ta <- dba.count(ta, summits=75)
ta <- dba.contrast(ta, categories=DBA_CONDITION, minMembers = 2,
block=DBA_REPLICATE)
ta <- dba.analyze(ta)
ta.DB <- dba.report(ta, contrast=1, method=DBA_DESEQ2_BLOCK, bCounts=TRUE)
ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by Rory Stark2.8k
Answer: DiffBind Batch Effect
0
2.1 years ago by
niu.shengyong0 wrote:

Hi Rory,

I get the message after dba.report:

Error in colnames<-(*tmp*, value = c("Chr", "Start", "End", "Conc",  :
'names' attribute [9] must be the same length as the vector [7]

I use the version of 2.2.12 and I knew it may be the version issue. Is there any workaround for it if I still need to use version 2.2.12? Thanks!

If you email the DBA object (ta.DB) to me (or send me a link where I can download it), I can see what is going on. Email is on Bioconductor landing page for DiffBind.

-Rory

Just emailed to you. Thanks!

Answer: DiffBind Batch Effect
0
2.0 years ago by
niu.shengyong0 wrote:

It works when I tried in another version; however, there is another error now:

It shows the following error when I try to plot heatmap or PCA plot. "Error in pv.getPlotData(DBA, attributes = attributes, contrast = contrast,  :  Only one site to plot -- need at least 2!".

Thanks!

See answer for: DiffBind error Error in pv.getPlotData

Thanks Rory! After checking the PCA plot and correlation plot, I find there is still quite significant batch effect. Do you have any advice? Thanks!

The PCA and correlation plots may be misleading in this case. Modelling the batch effect in the GLM does not change the read counts in any way. So while the model used to identify differentially bound sites takes the batch effect into account, when you plot these sites using the count matrix, they will still exhibit the batch effect.

Short of altering the read counts for the plots (eg by singular value decomposition, as is used in SVA), we can't account for the batch effect in the plots. As DiffBind uses differential analysis methods that rely on raw reads (DESeq2 and edgeR), it avoids such alterations to the count matrix.

-R

Hi Rory,

Thanks for your help! In this case, how could I evaluate if there is still significant batch effect in my dataset after use the "BLOCK" arguments you mentioned? Thanks!

Having modelled the batch effect, the resulting statistics should be valid. In particular, I would expect that False Positive statistics to be valid. Yo may have more False Negatives formt he increased variance, but he sites that have low FDR should be good.

-R