Question: Use of batch corrected (removeBatchEffect) read counts for further downstream analysis.
gravatar for ashwini.kumar
19 months ago by
ashwini.kumar10 wrote:

I have RNA-seq data from 100 samples, 40 samples were sequenced using one library (Nextera) and 60 samples were sequenced using another library (Scriptseq), this resulted in known batch effect, now we want to analyze these samples together. To correct this known batch effect, I am using following two approaches, please suggest which one is correct and can we use these batch correct expression values for further analysis, where we need already batch corrected expression values e.g. network analysis?
1. CPM

  dge <- DGEList(counts=count)

  dge <- calcNormFactors(dge, method = “TMM”)

  logCPM <- cpm(dge,log=TRUE,prior.count=5)

  logCPM <- removeBatchEffect(logCPM,batch=batch, batch2 = batch2)


y <- DGEList(counts=count) <- voom(y,  plot=F, design = design.voom)

logCPM_voom <- removeBatchEffect(,batch=batch, batch2 = batch2)

We also have drug response data from these 100 samples and I would like to correlate gene expression with drug response.  Moreover, I want to use WGCNA and other network analysis approaches where I need already batch corrected expression values. I would like to know that can I use these batch corrected values in downstream analyses e.g. correalation (drug response and gene expresion) or these values are only useful for the visualization purposes e.g. heatmaps, PCA clustering.

Thank you!

Best regards,


limma removebatcheffect() • 649 views
ADD COMMENTlink modified 19 months ago by Aaron Lun23k • written 19 months ago by ashwini.kumar10
Answer: Use of batch corrected (removeBatchEffect) read counts for further downstream an
gravatar for Aaron Lun
19 months ago by
Aaron Lun23k
Cambridge, United Kingdom
Aaron Lun23k wrote:

To answer your immediate question: we generally recommend using the first approach. A large prior.count avoids getting very large negative values from zero counts, which could be misleading during visualization.

However, batch-corrected expression values should only be used when you have no other choice, i.e., in procedures that do not accept design matrices. This includes visualization with heatmaps, dimensionality reduction and various forms of clustering; it may include network analyses, though you'll have to look at those methods specifically. If you are looking for changes in expression with respect to drug response, you can do that as part of a standard DE analysis with blocking on batch - there is no need to use batch-corrected values.

More generally, though, I would question the wisdom of forcing two data sets into a single analysis. Differences in the technology will likely result in differences in the mean-variance relationship, normalization, etc., which will not be properly handled if all samples are lumped together. I would suggest doing some type of meta-analysis instead, e.g., A: Merge different datasets regarding a common number of DE genes in R.

ADD COMMENTlink modified 19 months ago • written 19 months ago by Aaron Lun23k

Thank you, Aaron 

ADD REPLYlink written 19 months ago by ashwini.kumar10
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 206 users visited in the last hour