Use of batch corrected (removeBatchEffect) read counts for further downstream analysis.
1
2
Entering edit mode
@ashwinikumar-13795
Last seen 7.3 years ago

Hi,
I have RNA-seq data from 100 samples, 40 samples were sequenced using one library (Nextera) and 60 samples were sequenced using another library (Scriptseq), this resulted in known batch effect, now we want to analyze these samples together. To correct this known batch effect, I am using following two approaches, please suggest which one is correct and can we use these batch correct expression values for further analysis, where we need already batch corrected expression values e.g. network analysis?
1. CPM
library(edgeR)

  dge <- DGEList(counts=count)

  dge <- calcNormFactors(dge, method = “TMM”)

  logCPM <- cpm(dge,log=TRUE,prior.count=5)

  logCPM <- removeBatchEffect(logCPM,batch=batch, batch2 = batch2)

2. VOOM

y <- DGEList(counts=count)

voom.data <- voom(y,  plot=F, design = design.voom)

logCPM_voom <- removeBatchEffect(voom.data,batch=batch, batch2 = batch2)

We also have drug response data from these 100 samples and I would like to correlate gene expression with drug response.  Moreover, I want to use WGCNA and other network analysis approaches where I need already batch corrected expression values. I would like to know that can I use these batch corrected values in downstream analyses e.g. correalation (drug response and gene expresion) or these values are only useful for the visualization purposes e.g. heatmaps, PCA clustering.

Thank you!

Best regards,

Ashwini 

removebatcheffect() limma • 3.5k views
ADD COMMENT
0
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 1 hour ago
The city by the bay

To answer your immediate question: we generally recommend using the first approach. A large prior.count avoids getting very large negative values from zero counts, which could be misleading during visualization.

However, batch-corrected expression values should only be used when you have no other choice, i.e., in procedures that do not accept design matrices. This includes visualization with heatmaps, dimensionality reduction and various forms of clustering; it may include network analyses, though you'll have to look at those methods specifically. If you are looking for changes in expression with respect to drug response, you can do that as part of a standard DE analysis with blocking on batch - there is no need to use batch-corrected values.

More generally, though, I would question the wisdom of forcing two data sets into a single analysis. Differences in the technology will likely result in differences in the mean-variance relationship, normalization, etc., which will not be properly handled if all samples are lumped together. I would suggest doing some type of meta-analysis instead, e.g., A: Merge different datasets regarding a common number of DE genes in R.

ADD COMMENT
0
Entering edit mode

Thank you, Aaron 

ADD REPLY
0
Entering edit mode

Hi Aaron,

Do you think this approach is right for making a clustering heatmap?

library(edgeR)
dge <- DGEList(counts=count)
dge <- calcNormFactors(dge, method = “TMM”)
logCPM <- cpm(dge,log=TRUE,prior.count=5)
logCPM_bc <- removeBatchEffect(logCPM,batch=batch)

# Scale data to Z-scores
logCPM_z <- t(scale(t(logCPM_bc)))

And then using this logCPM_z for clustering heatmap. Is this right?

ADD REPLY
0
Entering edit mode

Don't resurrect old threads, the original question is 2 years old.

And besides, you already have a perfectly good answer here.

ADD REPLY

Login before adding your answer.

Traffic: 609 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6