I have a RNA-seq dataset with two treatments A and B (+/+, +/-, -/+, -/- design) and a batch effect. Both environments are equally represented in all batches. For differential expression analysis, I have added this batch effect as an element in the design matrix in EdgeR and it works perfect.
I also want to make a clustered heatmap from the genes that were found to be differentially expressed. When I do this, the batch effect is visible and samples form the same batch group together (but there is a stronger grouping from treatment A). I wanted to remove this batch effect and re-do the clustering and heat map generation. For this purpose, I tried the limma package function removeBatchEffect() and a few other similar tools.
Here is what I have done:
(1) Defined three experimental factors: A, B and batch.
(2) Made a design matrix that only includes A and B, but not batch.
(3) Put raw, non-normalized counts (non-integer expected counts output from RSEM) in a DGEList.
(4) Used TMM-normalization.
(5) Converted to log2 counts per million.
(6) Ran removeBatchEffect() on the result of (5) and specified the design matrix without batch and the batch factor.
Here is the code I ran:
# Loading required R libraries. library(gplots) library(limma) library(edgeR) # Define experimental factors and design matrix. batch <- factor(substring(colnames(data), 12, 12)) factorA <- factor(substring(colnames(data), 1, 6)) factorB <- factor(substring(colnames(data), 7, 10)) design <- model.matrix(~factorA*factorB) # Put counts into DGEList. y <- DGEList(counts=data) # TMM-normalization. y <- calcNormFactors(y) # Convert to CPM and log2 transformation. logCPM <- cpm(y, log=TRUE, prior.count=0.1) # Remove batch effect. logCPM_no_batch <- removeBatchEffect(logCPM, batch=batch, design=design) # Make heat map heatmap(logCPM_no_batch)
The results look absolutely fantastic. Samples from the same batch no longer cluster at all, yet the basic appearance of the heatmap is the same in term of treatment effects.
Previous attempts at this some months ago produced horrible results (no heatmap patterning at all, not even small regions), presumably because I made mistakes, wrote wrong things etc. But when things goes from horrible to absolutely fantastic from an afternoon's work, I instantly become suspicious that it is too good to be true and that I have made some other error that makes it all invalid somehow. Sure, it is a bit paranoid perhaps, but better safe than sorry in these relatively complicated scientific matters.
So my two questions are:
(1) Is the usage above, both the direct application of removeBatchEffect() and the prior treatment of the data appropriate for the stated goal of making a "batch-free" clustered heatmap? Any red flags that pop up? I have browsed through several previous posts and looked at the manual for this function before asking.
(2) Is it crucial to include a length normalization before making a clustered heatmap for these kind of RNA-seq-based differential expression figures? Obviously, the gene length is exactly the same for all samples for a given gene, so it will not bias within-gene between-sample comparisons. What would I lose if i skipped it? Would it affect any technical aspect (such as gene hierarchical clustering) or what kind of conclusions that can be drawn from the heatmap? If I wanted to include it, is it enough to switch from cpm() to rpkm() and add a vector with gene lengths?