Question

voom + removeBatchEffect + gene clustering

1

Entering edit mode

ale.breschi ▴ 10

@alebreschi-8359

Last seen 8.8 years ago

Spain

Dear Dr. Smyth,

We are analyzing some RNA-seq samples collected in different batches, where the batch is a known variable. To account for that we reasoned we could use a linear model to include the batch effect and then remove it.

Since the voom+limma approach is shown to work well for differential gene expression, we thought of estimating the weights for each observation through voom and then use them in the limma function removeBatchEffect(). In the end we get log2(cpm) corrected for the batch (I guess?) and we get some biologically meaningful clustering.

As a next step, we wanted to cluster genes based on the batch corrected expression values.

So my questions are:

1) is this approach for batch effect correction valid in the first place?

2) Would it make sense to use log2(cpm) corrected values to cluster genes, without taking the gene length into account? Should we worry that longer genes could cluster together just because of their length? One solution would be transforming log2(cpm) to RPKM, but by doing this we lose some benefits of the batch correction step.

Many thanks in advance for your time and help.

Best regards,

Alessandra Breschi

voom limma clustering removebatcheffect() batch effect • 4.2k views

ADD COMMENT • link updated 8.8 years ago by Gordon Smyth 50k • written 8.8 years ago by ale.breschi ▴ 10

score 4 · Answer 1 · 2015-07-08

4

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 3 hours ago

WEHI, Melbourne, Australia

1) is this approach for batch effect correction valid in the first place?

Yes, this is ok for clustering, although personally I prefer to use cpm(dge, log=TRUE, prior.count=3) or rpkm(dge, log=TRUE, prior.count=3) instead of voom() for this purpose.

2) Would it make sense to use log2(cpm) corrected values to cluster genes, without taking the gene length into account? Should we worry that longer genes could cluster together just because of their length?

It depends on what clustering method you are intending to use. If you are planning to cluster by Euclidean distance without standardizing the expression scores, then you should convert to logRPKM. With other common clustering methods, it will make no difference whether you use logCPM or logRPKM.

One solution would be transforming log2(cpm) to RPKM, but by doing this we lose some benefits of the batch correction step.

You can convert to logRPKM simply by subtracting log2(GeneLength/1000) each row of the voom logCPM output. This does not lose any of the benefits of the batch correction. You can do this before or after the batch correction.

ADD COMMENT • link 8.8 years ago Gordon Smyth 50k

1

Entering edit mode

Hi Gordon,

I came across this thread while searching for answers for my own voom/removeBatchEffect question. First, could you elaborate as to why you prefer to use cpm(dge, log=TRUE, prior.count=3) instead of voom() for clustering and/or input into removeBatchEffect(), even if you are using voom() for the statistical analysis?

Second, would you also recommend using cpm() or voom()$E as input into sva to estimate surrogate variables? [I prefer using sva() on TMM-normalized values over svaseq() because svaseq doesn't appear to be able to incorporate the normalization factors]. Given the situation of sva.obj <- sva(v$E, mod1, mod0,method="irw",n.sv=10) in this post (A: sva + voom + limma) your response was "It's a pity that sva() can't use weights, but I can't suggest anything better."

Finally, my own question is, if I am using voom - sva - voom for analysis but also want to remove the surrogate variables via removeBatchEffect() for visualization purposes, should I be passing the EList from the second voom call or the first? The v$E values are the same, only the weights are different. Or should I still use cpm() as input? A hypothetical example:

y <- DGEList(counts)
y <- calcNormFactors(y)
design <- model.matrix(~Groups) # 4 groups
v <- voom(y, design)
svobj <- sva(v$E, design, design[,1])
design2 <- cbind(design, svobj$sv)
v2 <- voom(y, design2)
fit <- eBayes(lmFit(v2, design2)) #for the stats
no.sv1 <- removeBatchEffect(v, design = design, covariates = svobj$sv)
no.sv2 <- removeBatchEffect(v2, design = design, covariates = svobj$sv)
no.sv3 <- removeBatchEffect(cpm(y, log=TRUE, prior.count=3), design = design, covariates = svobj$sv)

I've tested this on an experiment where there were few expression differences due to group, and done PCA clustering on the 3 sets of corrected values (I can't share the data, but see images below). no.sv2 showed much more separation between the groups than no.sv1 or no.sv3, but the other two make more sense based on the statistical results. Which one is more correct to use for visualizations?

Thanks,
Jenny

ADD REPLY • link 7.6 years ago Jenny Drnevich ★ 2.0k

0

Entering edit mode

voom always calculates the same abundance values. It's nearly equivalent to cpm(y, log=TRUE, prior.count=0.5). So the only difference between the two voom calls should be the weights, and in fact the first voom call is effectively extraneous, since the weights are never used by sva. Generally, I would run cpm(y, log=TRUE, prior.count=0.5) and then run sva on the resulting logCPM values, and then incorporate the SVs into the design matrix and run voom and the rest of the differential expression pipeline. This corresponds to the "design2" and "v2" variables in your code. However, I'm surprised that there's such a large difference between the PCA plots for no.sv1 and no.sv2. The only difference between the two should be the weights calculated by voom, and generally the lowess curve used to determine the weights should not be substantially affected by changing the design. So you should inspect the plots produced by re-running both voom calls with plot=TRUE to see if they produce similar mean-variance trends.

As for clustering and other such applications, the methods used typically don't allow for the use of weights, so voom is not usable in this case. Instead, we use a larger prior count to down-weight the contribution of low-abundance genes by squeezing their relative abundances toward the mean for each gene. This allows algorithms to focus more on the higher-abundance genes, whose measurements have better precision, without explicitly weighting the values.

ADD REPLY • link 7.6 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Hi Ryan,

Thanks for your reply. The lowess curves between the two voom weights are very similar, just the second one is shifted down slightly because the variances are slightly lower, as expected by adding the surrogate variables. I checked the PCAs again, and it turns out that PC2 and PC3 explain almost the same amount of variance and that the weights were just different enough to switch the ordering between the two - no.sv1 PC1 vs. PC3 looks very similar to no.sv2 PC1 vs. PC2. And heatmaps of the significant genes look almost identical, so I think it turns out not to matter that much!

Jenny

ADD REPLY • link 7.6 years ago Jenny Drnevich ★ 2.0k

0

Entering edit mode

Thanks for the prompt response!

ADD REPLY • link 8.8 years ago ale.breschi ▴ 10