Question: Use of batch corrected (removeBatchEffect) read counts for further downstream analysis.
1
21 months ago by
ashwini.kumar10 wrote:

Hi,
I have RNA-seq data from 100 samples, 40 samples were sequenced using one library (Nextera) and 60 samples were sequenced using another library (Scriptseq), this resulted in known batch effect, now we want to analyze these samples together. To correct this known batch effect, I am using following two approaches, please suggest which one is correct and can we use these batch correct expression values for further analysis, where we need already batch corrected expression values e.g. network analysis?
1. CPM
library(edgeR)

dge <- DGEList(counts=count)

dge <- calcNormFactors(dge, method = “TMM”)

logCPM <- cpm(dge,log=TRUE,prior.count=5)

logCPM <- removeBatchEffect(logCPM,batch=batch, batch2 = batch2)

2. VOOM

y <- DGEList(counts=count)

voom.data <- voom(y,  plot=F, design = design.voom)

logCPM_voom <- removeBatchEffect(voom.data,batch=batch, batch2 = batch2)

We also have drug response data from these 100 samples and I would like to correlate gene expression with drug response.  Moreover, I want to use WGCNA and other network analysis approaches where I need already batch corrected expression values. I would like to know that can I use these batch corrected values in downstream analyses e.g. correalation (drug response and gene expresion) or these values are only useful for the visualization purposes e.g. heatmaps, PCA clustering.

Thank you!

Best regards,

Ashwini

limma removebatcheffect() • 743 views
modified 21 months ago by Aaron Lun23k • written 21 months ago by ashwini.kumar10
Answer: Use of batch corrected (removeBatchEffect) read counts for further downstream an
0
21 months ago by
Aaron Lun23k
Cambridge, United Kingdom
Aaron Lun23k wrote:

To answer your immediate question: we generally recommend using the first approach. A large prior.count avoids getting very large negative values from zero counts, which could be misleading during visualization.

However, batch-corrected expression values should only be used when you have no other choice, i.e., in procedures that do not accept design matrices. This includes visualization with heatmaps, dimensionality reduction and various forms of clustering; it may include network analyses, though you'll have to look at those methods specifically. If you are looking for changes in expression with respect to drug response, you can do that as part of a standard DE analysis with blocking on batch - there is no need to use batch-corrected values.

More generally, though, I would question the wisdom of forcing two data sets into a single analysis. Differences in the technology will likely result in differences in the mean-variance relationship, normalization, etc., which will not be properly handled if all samples are lumped together. I would suggest doing some type of meta-analysis instead, e.g., A: Merge different datasets regarding a common number of DE genes in R.

Thank you, Aaron

Hi Aaron,

Do you think this approach is right for making a clustering heatmap?

library(edgeR)
dge <- DGEList(counts=count)
dge <- calcNormFactors(dge, method = “TMM”)
logCPM <- cpm(dge,log=TRUE,prior.count=5)
logCPM_bc <- removeBatchEffect(logCPM,batch=batch)

# Scale data to Z-scores
logCPM_z <- t(scale(t(logCPM_bc)))


And then using this logCPM_z for clustering heatmap. Is this right?

Don't resurrect old threads, the original question is 2 years old.