Dear Dr. Smyth,
We are analyzing some RNA-seq samples collected in different batches, where the batch is a known variable. To account for that we reasoned we could use a linear model to include the batch effect and then remove it.
Since the voom+limma approach is shown to work well for differential gene expression, we thought of estimating the weights for each observation through voom and then use them in the limma function removeBatchEffect(). In the end we get log2(cpm) corrected for the batch (I guess?) and we get some biologically meaningful clustering.
As a next step, we wanted to cluster genes based on the batch corrected expression values.
So my questions are:
1) is this approach for batch effect correction valid in the first place?
2) Would it make sense to use log2(cpm) corrected values to cluster genes, without taking the gene length into account? Should we worry that longer genes could cluster together just because of their length? One solution would be transforming log2(cpm) to RPKM, but by doing this we lose some benefits of the batch correction step.
Many thanks in advance for your time and help.
Best regards,
Alessandra Breschi
Hi Gordon,
I came across this thread while searching for answers for my own voom/removeBatchEffect question. First, could you elaborate as to why you prefer to use cpm(dge, log=TRUE, prior.count=3) instead of voom() for clustering and/or input into removeBatchEffect(), even if you are using voom() for the statistical analysis?
Second, would you also recommend using cpm() or voom()$E as input into sva to estimate surrogate variables? [I prefer using sva() on TMM-normalized values over svaseq() because svaseq doesn't appear to be able to incorporate the normalization factors]. Given the situation of
sva.obj <- sva(v$E, mod1, mod0,method="irw",n.sv=10)
in this post (A: sva + voom + limma) your response was "It's a pity that sva() can't use weights, but I can't suggest anything better."Finally, my own question is, if I am using voom - sva - voom for analysis but also want to remove the surrogate variables via removeBatchEffect() for visualization purposes, should I be passing the EList from the second voom call or the first? The v$E values are the same, only the weights are different. Or should I still use cpm() as input? A hypothetical example:
I've tested this on an experiment where there were few expression differences due to group, and done PCA clustering on the 3 sets of corrected values (I can't share the data, but see images below). no.sv2 showed much more separation between the groups than no.sv1 or no.sv3, but the other two make more sense based on the statistical results. Which one is more correct to use for visualizations?
Thanks,
Jenny
voom always calculates the same abundance values. It's nearly equivalent to
cpm(y, log=TRUE, prior.count=0.5)
. So the only difference between the two voom calls should be the weights, and in fact the first voom call is effectively extraneous, since the weights are never used by sva. Generally, I would runcpm(y, log=TRUE, prior.count=0.5)
and then run sva on the resulting logCPM values, and then incorporate the SVs into the design matrix and run voom and the rest of the differential expression pipeline. This corresponds to the "design2" and "v2" variables in your code. However, I'm surprised that there's such a large difference between the PCA plots for no.sv1 and no.sv2. The only difference between the two should be the weights calculated by voom, and generally the lowess curve used to determine the weights should not be substantially affected by changing the design. So you should inspect the plots produced by re-running both voom calls with plot=TRUE to see if they produce similar mean-variance trends.As for clustering and other such applications, the methods used typically don't allow for the use of weights, so voom is not usable in this case. Instead, we use a larger prior count to down-weight the contribution of low-abundance genes by squeezing their relative abundances toward the mean for each gene. This allows algorithms to focus more on the higher-abundance genes, whose measurements have better precision, without explicitly weighting the values.
Hi Ryan,
Thanks for your reply. The lowess curves between the two voom weights are very similar, just the second one is shifted down slightly because the variances are slightly lower, as expected by adding the surrogate variables. I checked the PCAs again, and it turns out that PC2 and PC3 explain almost the same amount of variance and that the weights were just different enough to switch the ordering between the two - no.sv1 PC1 vs. PC3 looks very similar to no.sv2 PC1 vs. PC2. And heatmaps of the significant genes look almost identical, so I think it turns out not to matter that much!
Jenny
Thanks for the prompt response!