Question

DiffBind Batch Effect

1

Entering edit mode

niu.shengyong ▴ 10

@niushengyong-12910

Last seen 8.6 years ago

I'm now running ATAC-seq differential analysis by DiffBind with peaks from MACS2. I have 10 samples from batch 3 and 20 samples from batch 5. How could I eliminate batch effect and normalize them in DiffBind? Sorry, I'm quite new in this area. Thanks!

Simon

diffbind atac • 7.2k views

ADD COMMENT • link 8.6 years ago niu.shengyong ▴ 10

score 2 · Answer 1 · 2017-06-01

2

Entering edit mode

Rory Stark ★ 5.2k

@rory-stark-5741

Last seen 12 months ago

Cambridge, UK

If you see a batch effect (eg in the PCA plots), the best way to handle them is to model them. We don't want to adjust the count matrix directly as that violates the assumption of the differential analyses in DESeq2 and edgeR.

To model the batch, use the block parameter in dba.contrast(). For example, if the batch numbers were stored in the Replicate field, you would say block=DBA_REPLICATE. Then when you run dba.analyze(), it will do the analysis two ways: as a single-factor analysis, and with the "blocking" factor representing the batch. To use the results from the blocked analysis, you need to specify method=DBA_DESEQ2_BLOCK (or method=DBA_EDGER_BLOCK) when calling dba.report() or any of the plotting functions.

-Rory

ADD COMMENT • link 8.6 years ago Rory Stark ★ 5.2k

0

Entering edit mode

I have 8 samples from 2 different batches without replicates. How could I use the block argument?

Thanks!

ADD REPLY • link 8.6 years ago niu.shengyong ▴ 10

3

Entering edit mode

Hopefully you have some replicates in the sense that multiple samples represent alternative experiments of the same condition.

The first step is to mark the batch using one of the metadata fields. You could use for example Condition, Treatment, or Replicate. Create a column in the sample sheet with one of these names, and for the sample in the first batch set the value to "1" and for the samples in the second batch set them to "2" (or "Batch1"and "Batch2" if you are not using the Replicate field).

For example, if the comparison you want to make is between two conditions, you would set the condition labels for each sample in the Condition column, and the corresponding batch number in the Replicate column. Then, after you have loaded the sample sheet using dba() and computed the read counts for the consensus set using dba.count(), you call dba.contrast() and dba.analyze():

> myDBA <- dba.contrast(myDBA, categories=DBA_CONDITION, block=DBA_REPLICATE)
> myDBA <- dba.analyze(myDBA)
> myDBA

In the contrast listing for myDBA, you should see values for two analyses: DB.DESeq2 and DB.DESEq2.block. The first is re[presents the results for a single-factor analysis, while the second reflects the additional modelling of the batch effect.

To use the results of the multi-factor analysis, you need to specify method=DBA_DESEQ2_BLOCK, eg:

> dba.plotMA(myDBA, method=DBA_DESEQ2_BLOCK)
> DBresults <- dba.report(myDBA, method=DBA_DESEQ2_BLOCK)
> plot(myDBA, contrast=1, method=DBA_DESEQ2_BLOCK)
> dba.plotPCA(myDBA, contrast=1, method=DBA_DESEQ2_BLOCK)

Some other tips:

You can set the name of the metadata field you are using to represent he batch in subsequent plots by setting: myDBA$config$replicate <- "Batch"
You can use the multi-factor analysis by default (don't need to specify the method= parameter every time) by setting: myDBA$config$AnalysisMethod <- DBA_DESEQ2_BLOCK

-Rory

ADD REPLY • link 8.6 years ago Rory Stark ★ 5.2k

0

Entering edit mode

Hi Rory,

When I follow your suggestions, it shows the error: "Error in pv.DBAreport(pv=DBA, contrast = contrast, method = method: DESeq2 analysis has not been run for this contrast."

My codes:

ta <- dba(sampleSheet = "diff.csv")
ta <- dba.count(ta, summits=75)
ta = dba.contrast(ta, categories=DBA_CONDITION, minMembers = 2, block=DBA_REPLICATE)
ta <- dba.contrast(ta)
ta = dba.analyze(ta, method=DBA_DESEQ2_BLOCK)
ta.DB = dba.report(ta, contrast=1, method=DBA_DESEQ2_BLOCK, bCounts=TRUE)

It also shows the following output when I type "ta" after the second dba.contrast()

Group1 Members1 Group2 Member2

Control 3 Test 5

Simon

ADD REPLY • link 8.6 years ago niu.shengyong ▴ 10

0

Entering edit mode

I have 2 batches. One is "batch 3" and another one is "batch 5", so I put numerical value 3 or 5 in the Replicate column.

ADD REPLY • link 8.6 years ago niu.shengyong ▴ 10

0

Entering edit mode

Hi Simon-

My example code had an error (I've fixed it above). The second call to dba.contrast() should be a call to dba.analyze(). So in your code you should remove the second call to dba.contrast() and remove the "method=DBA_DESEQ2_BLOCK" parameter from the call to dba.analyze():

ta <- dba(sampleSheet = "diff.csv")
ta <- dba.count(ta, summits=75)
ta <- dba.contrast(ta, categories=DBA_CONDITION, minMembers = 2, 
                   block=DBA_REPLICATE)
ta <- dba.analyze(ta)
ta.DB <- dba.report(ta, contrast=1, method=DBA_DESEQ2_BLOCK, bCounts=TRUE)

ADD REPLY • link 8.6 years ago Rory Stark ★ 5.2k

score 0 · Answer 2 · 2017-06-05

0

Entering edit mode

niu.shengyong ▴ 10

@niushengyong-12910

Last seen 8.6 years ago

Hi Rory,

I get the message after dba.report:

Error in `colnames<-`(`*tmp*`, value = c("Chr", "Start", "End", "Conc",  : 
  'names' attribute [9] must be the same length as the vector [7]

I use the version of 2.2.12 and I knew it may be the version issue. Is there any workaround for it if I still need to use version 2.2.12? Thanks!

ADD COMMENT • link 8.6 years ago niu.shengyong ▴ 10

0

Entering edit mode

If you email the DBA object (ta.DB) to me (or send me a link where I can download it), I can see what is going on. Email is on Bioconductor landing page for DiffBind.

-Rory

ADD REPLY • link 8.6 years ago Rory Stark ★ 5.2k

0

Entering edit mode

Just emailed to you. Thanks!

ADD REPLY • link 8.6 years ago niu.shengyong ▴ 10

score 0 · Answer 3 · 2017-06-08

0

Entering edit mode

niu.shengyong ▴ 10

@niushengyong-12910

Last seen 8.6 years ago

It works when I tried in another version; however, there is another error now:

It shows the following error when I try to plot heatmap or PCA plot. "Error in pv.getPlotData(DBA, attributes = attributes, contrast = contrast, : Only one site to plot -- need at least 2!".

Thanks!

ADD COMMENT • link 8.6 years ago niu.shengyong ▴ 10

0

Entering edit mode

See answer for: DiffBind error Error in pv.getPlotData

ADD REPLY • link 8.6 years ago Rory Stark ★ 5.2k

0

Entering edit mode

Thanks Rory! After checking the PCA plot and correlation plot, I find there is still quite significant batch effect. Do you have any advice? Thanks!

ADD REPLY • link 8.6 years ago niu.shengyong ▴ 10

0

Entering edit mode

The PCA and correlation plots may be misleading in this case. Modelling the batch effect in the GLM does not change the read counts in any way. So while the model used to identify differentially bound sites takes the batch effect into account, when you plot these sites using the count matrix, they will still exhibit the batch effect.

Short of altering the read counts for the plots (eg by singular value decomposition, as is used in SVA), we can't account for the batch effect in the plots. As DiffBind uses differential analysis methods that rely on raw reads (DESeq2 and edgeR), it avoids such alterations to the count matrix.

-R

ADD REPLY • link 8.6 years ago Rory Stark ★ 5.2k

0

Entering edit mode

Hi Rory,

Thanks for your help! In this case, how could I evaluate if there is still significant batch effect in my dataset after use the "BLOCK" arguments you mentioned? Thanks!

ADD REPLY • link 8.6 years ago niu.shengyong ▴ 10

0

Entering edit mode

Having modelled the batch effect, the resulting statistics should be valid. In particular, I would expect that False Positive statistics to be valid. Yo may have more False Negatives formt he increased variance, but he sites that have low FDR should be good.

-R

ADD REPLY • link 8.6 years ago Rory Stark ★ 5.2k

0

Entering edit mode

Thanks for your help!

ADD REPLY • link 8.5 years ago niu.shengyong ▴ 10