Search
Question: Is the following a correct usage of Limma's removeBatchEffect() for clustered heat map generation?
2
2.9 years ago by
Ekarl230
Sweden
Ekarl230 wrote:

I have a RNA-seq dataset with two treatments A and B (+/+, +/-, -/+, -/- design) and a batch effect. Both environments are equally represented in all batches. For differential expression analysis, I have added this batch effect as an element in the design matrix in EdgeR and it works perfect.

I also want to make a clustered heatmap from the genes that were found to be differentially expressed. When I do this, the batch effect is visible and samples form the same batch group together (but there is a stronger grouping from treatment A). I wanted to remove this batch effect and re-do the clustering and heat map generation. For this purpose, I tried the limma package function removeBatchEffect() and a few other similar tools.

Here is what I have done:

(1) Defined three experimental factors: A, B and batch.
(2) Made a design matrix that only includes A and B, but not batch.
(3) Put raw, non-normalized counts (non-integer expected counts output from RSEM) in a DGEList.
(4) Used TMM-normalization.
(5) Converted to log2 counts per million.
(6) Ran removeBatchEffect() on the result of (5) and specified the design matrix without batch and the batch factor.

Here is the code I ran:

# Loading required R libraries.
library(gplots)
library(limma)
library(edgeR)

# Define experimental factors and design matrix.
batch <- factor(substring(colnames(data), 12, 12))
factorA <- factor(substring(colnames(data), 1, 6))
factorB <- factor(substring(colnames(data), 7, 10))
design <- model.matrix(~factorA*factorB)

# Put counts into DGEList.
y <- DGEList(counts=data)

# TMM-normalization.
y <- calcNormFactors(y)

# Convert to CPM and log2 transformation.
logCPM <- cpm(y, log=TRUE, prior.count=0.1)

# Remove batch effect.
logCPM_no_batch <- removeBatchEffect(logCPM, batch=batch, design=design)

# Make heat map
heatmap(logCPM_no_batch)

The results look absolutely fantastic. Samples from the same batch no longer cluster at all, yet the basic appearance of the heatmap is the same in term of treatment effects.

Previous attempts at this some months ago produced horrible results (no heatmap patterning at all, not even small regions), presumably because I made mistakes, wrote wrong things etc. But when things goes from horrible to absolutely fantastic from an afternoon's work, I instantly become suspicious that it is too good to be true and that I have made some other error that makes it all invalid somehow. Sure, it is a bit paranoid perhaps, but better safe than sorry in these relatively complicated scientific matters.

So my two questions are:

(1) Is the usage above, both the direct application of removeBatchEffect() and the prior treatment of the data appropriate for the stated goal of making a "batch-free" clustered heatmap? Any red flags that pop up? I have browsed through several previous posts and looked at the manual for this function before asking.

(2) Is it crucial to include a length normalization before making a clustered heatmap for these kind of RNA-seq-based differential expression figures? Obviously, the gene length is exactly the same for all samples for a given gene, so it will not bias within-gene between-sample comparisons. What would I lose if i skipped it? Would it affect any technical aspect (such as gene hierarchical clustering) or what kind of conclusions that can be drawn from the heatmap? If I wanted to include it, is it enough to switch from cpm() to rpkm() and add a vector with gene lengths?

modified 2.9 years ago by Bernd Klaus520 • written 2.9 years ago by Ekarl230
3
2.9 years ago by
Aaron Lun21k
Cambridge, United Kingdom
Aaron Lun21k wrote:
1. Looks fine to me. I would suggest that you mean-center your expression values before calling heatmap (i.e., by subtracting rowMeans(logCPM_no_batch) from logCPM_no_batch). This ensures that your colours aren't all-red or all-blue for highly or lowly expressed genes, respectively. Rather, the heatmap will focus on differences in relative expression within each gene. We also tend to use larger prior.count for visualisation, just to avoid being the range of colours being dominated by extreme log-fold differences due to zeroes; but that's a nitpicky point, so you'd probably be okay.
2. If you do the mean-centering like I suggested above, then length normalization won't matter because it'll just cancel out within each gene. By and large, most downstream analyses will compare within genes, so you needn't worry about length normalization in most situations. But if you did, then rpkm would be the way to go (provided you use a vector of gene lengths matching the annotation used for counting).

In the part of the custom code I plan to use for making a heat map (I added heatmap() for simplicity), I use the following code:

data <- t(scale(t(logCPM_no_batch), scale=F))

Does this accomplish mean-centering in the way you propose?

For the purposes of this heat map, it will just be an illustration in a paper. No downstream analysis will be done using the heatmap as such, but instead be done on the edgeR differential expression analysis. Am I understanding you correctly that it will then not really matter if I skip length normalization for the purposes of generating a heatmap?

I will try rpkm() just so I can see what the results look like.

Yes (though I find it easier just to subtract it directly) and yes (it'll cancel out when you centre the log-expression).

There is no need to mean-center because heatmap() does this automatically by default. See the 'scale' argument.

The only reason for scaling yourself is if you want to mean correct but not scale the sd. In that case the fastest and easiest code is:

logCPM.zeromean <- logCPM_no_batch - rowMeans(logCPM_no_batch)

Of course this will be irrelevant unless you turn off the scaling that heatmap() does. And it is more usual to allow heatmap() to scale both mean and sd.

Please use a larger value for prior.count. We use values in the range 3--5. It can make a difference.

As Aaron says, length normalization makes no difference. cpm() and rpkm() will give identical heatmaps.

Thank you for your valuable reply, particularly with regards to prior count and rpkm().

If I

data <- t(scale(t(logCPM_no_batch), scale=T))

instead of scale=F and set scale="none" in heatmap() would this produce a heat map that had both mean and sd scaled, thus being more in line with what is usually done? This is how I interpret the scale() documentation, but just want to make sure I am not going off the rails here.

The reason I ask is that if i comment out the data <- t(scale(t(logCPM_no_batch), scale=T)) and use scale defaults in heatmap, heatmap() complains that "x' must be a numeric matrix".

It isn't possible for the code given in your post to give the error message you mention from heatmap() because removeBatchEffect() always produces a numeric matrix. You already told us in your original post that the code worked.

It would seem that you may have introduced a problem somewhere with other code that you haven't mentioned here. I suggest you solve that problem rather than using an ad hoc fix with scale().

I'm conflicted on whether or not to post as a new question, but since this is a directly related follow up I'll ask here.

Recently I wanted to present the results of how accounting for batch in a differential expression analysis from RNA-seq will impact the downstream results aside from just providing two topTable results (one each from a design with and without a batch covariate) for comparison, so I also wanted to show before / after heatmaps of raw / corrected data.

I initially provided these results using the suggested workflow here, which is to provide the results from cpm(dgelist, log=TRUE, prior.count=5) to removeBatchEffect to get your "batch corrected data".

Skimming through the removeBatchEffect documentation, however, I realized that it explicitly calls out that the function will utilize weights, if available, in estimation of the batch effects.

This made me think about voom, but passing in a voomed EList dictates that the prior.count=0.5

At the risk of beating a dead horse here, my question is what's the official party line here? What would the source of the weights be that removeBatchEffect is referencing if not from voom? I realize that the documentation in removeBatchEffect that points to weights may have been there since before voom, so perhaps it's another function that shoots weights along with the input object x, but not sure where that might come from.

1

There's also other weights like the array weights from, well, the arrayWeights function. Obviously, removeBatchEffects is not restricted to RNA-seq data, so the use of weights is more general than just those from voom. My view is that I only ever use removeBatchEffects for visualization, so I don't bother worrying about weights to improve accuracy - the human eye can't distinguish between "very blue" and "very very blue" anyway. However, I do worry when I get a solid red/blue block because of extreme log-fold changes from low counts. Hence, the use of a large prior count in cpm.

I had initially thought that it could be referencing weights fom arrayWeights, but arrayWeights doesn't "load" those results into the EList (presumably what you pass into removeBatchEffect as x), but rather returns a vector of weights, so wasn't sure if that was it.

But I now realize that you can just use the dots ... and pass in a weights argument which then passes them down to lmFit which then does the job, ie.

aw <- arrayWeights(x, design, ...)
b0 <- removeBatchEffects(x, batch, weights=aw, ...)

Sorry I didn't put 2 & 2 together like that before.

To address your point about extreme log fold changes blowing out the signal in your heatmap, I've found the circlize::colorRamp2 (which is what ComplexHeatmap uses by default to generate its colors) awesome here. Values outside of the specified breaks are simply set to the break value/color for plotting purposes, which is nice.

1
2.9 years ago by
Bernd Klaus520
Germany
Bernd Klaus520 wrote:

Hi EKarl2,

just add something potentially useful to you and maybe others: removeBatchEffect() basically substracts estimated the batch  coefficients/means from the expression data.

I implemented the same strategy a while ago using simulated data and DESeq2, detailing the the steps involved. For what it's worth, here is the code, I use a PCA instead of a heatmap to visualize the differences, but it should give you an idea on how it works:

set.seed(777)
library(DESeq2)
library(ggplot2)
library(matrixStats)

## !!! caveat: here the size factors are all equal to one to simply the calculations !!!

### create example data set with differential expression between the two groups
dds <- makeExampleDESeqDataSet(m=8, betaSD = 2)

## add sex as a batch effect, there are two males and two females in each group
colData(dds)$sex <- factor(rep(c("m", "f"), 4)) ## modify counts to add a batch effect, we add normally distributed random noise ## with mean 2 to randomly selected genes of male samples and then round the result cts <- counts(dds) ranGenes <- floor(runif(300)*1000) for(i in ranGenes){ cts[i, colData(dds)$sex == "m"] <- as.integer(cts[i, colData(dds)$sex == "m"] + round(rnorm(1,4, sd = 1))) } counts(dds) <- cts ## produce PCA plot to see the batch effects rld <- rlogTransformation(dds, blind=TRUE) ntop = 500 Pvars <- rowVars(assay(rld)) select <- order(Pvars, decreasing = TRUE)[seq_len(min(ntop, length(Pvars)))] PCA <- prcomp(t(assay(rld)[select, ]), scale = F) percentVar <- round(100*PCA$sdev^2/sum(PCA$sdev^2),1) dataGG = data.frame(PC1 = PCA$x[,1], PC2 = PCA$x[,2], PC3 = PCA$x[,3], PC4 = PCA$x[,4], sex = colData(rld)$sex,
condition = colData(rld)$condition) (qplot(PC1, PC2, data = dataGG, color = condition, shape = sex, main = "PC1 vs PC2, top variable genes", size = I(6)) + labs(x = paste0("PC1, VarExp:", round(percentVar[1],4)), y = paste0("PC2, VarExp:", round(percentVar[2],4))) + scale_colour_brewer(type="qual", palette=2) ) ## the first PC shows a clear group separation, while the second ## PC shows a clear clustering by sex # We now run the DESeq computations and ## extract the fitted sex coefficient to remove the batch effect design(dds) <- ~ sex + condition dds <- DESeq(dds) ## display the names of the fitted coefficients resultsNames(dds) ## Note that DESeq2 will fit a coefficient for each group, which ## creates non-full rank design matrices. However, due to the shrinkage estimation (beta-shrinkage), ## this is not a problem ## extract means for females on the original scale, they scatter around 1 ## make sure to turn off outlier detection and independent filtering ## to avoid NAs mean.f <- results(dds, contrast = c(0,1,0,0,0), independentFiltering = FALSE, cooksCutoff = FALSE)$log2FoldChange

hist(2^mean.f, col = "darkgreen", main = "Female effects")

## extract means for males on the log2 scale, we  see a strong right hand side tail
## that corresponds ot the genes affected by the sex effect.
mean.m <- results(dds, contrast = c(0,0,1,0,0),
independentFiltering = FALSE,
cooksCutoff = FALSE)$log2FoldChange hist(2^mean.m, col = "lavender", main = "Male effects") ## remaining NAs correspond to genes with zero counts in all samples ## We thus set the NAs to zero mean.m[is.na(mean.m)] <- 0 mean.f[is.na(mean.f)] <- 0 ## now remove the batch effects cts.corrected <- counts(dds) for(i in 1:1000){ cts.corrected[i, colData(dds)$sex == "f"] <- as.integer(round(cts[i, colData(dds)$sex == "f"]/2^mean.f[i])) cts.corrected[i, colData(dds)$sex == "m"] <- as.integer(round(cts[i, colData(dds)$sex == "m"]/2^mean.m[i])) } counts(dds) <- cts.corrected ## now produce PCA plot again ## produce PCA plot to see the batch effects rld <- rlogTransformation(dds, blind=TRUE) ntop = 500 Pvars <- rowVars(assay(rld)) select <- order(Pvars, decreasing = TRUE)[seq_len(min(ntop, length(Pvars)))] PCA <- prcomp(t(assay(rld)[select, ]), scale = F) percentVar <- round(100*PCA$sdev^2/sum(PCA$sdev^2),1) dataGG = data.frame(PC1 = PCA$x[,1], PC2 = PCA$x[,2], PC3 = PCA$x[,3], PC4 = PCA$x[,4], sex = colData(rld)$sex,
condition = colData(rld)\$condition)

(qplot(PC1, PC2, data = dataGG, color =  condition, shape = sex,
main = "PC1 vs PC2, top variable genes, corrected data", size = I(6))
+ labs(x = paste0("PC1, VarExp:", round(percentVar[1],4)),
y = paste0("PC2, VarExp:", round(percentVar[2],4)))
+ scale_colour_brewer(type="qual", palette=2)
)

## The correction was successful, the second PC no longer separates
## samples  according to sex`