Question

Limma: How to deal with multiple batch effects and what is going on under the hood

2

Entering edit mode

jmannhei ▴ 20

@d7247bd3

Last seen 18 months ago

United States

Hi all,

I basically have a bunch or RNA-SEQ samples that cluster into different groups under a certain gene signature, the samples are from two different batches so batch needs to be factored. Say I want to take a quarter of my samples that respond best to some treatment and the then another quarter that does the worst and see if there is any differentially expressed genes.

Option number 1: Split the dataset into two different groups, one for for the first cluster and one for the other cluster and run DEG analysis as outlined in the Limma Manual https://www.bioconductor.org/packages/devel/bioc/vignettes/limma/inst/doc/usersguide.pdf (page 44)

# Let Batch Correspond to Batch Variable and Group correspond to to response group X0 is RNA-SEQ Matrix
# G=Good group P= Poor Group
design<-  model.matrix(~Group+Batch)
y<-voom(X0,design,plot=T)
fit1<-lmfit(y,design)
fit<-eBayes(fit1)
tpt.table<-topTable(fit,coef="GroupG",dort.by'p',number=Inf,adjust="BH")

There is also another way I have seen it been done with contrasts matrices

design<-model.matrix(~0+Group+Batch)
cmat<-makeContrasts(diff=GroupG-GroupP,levels=design)
y<-voom(X0,design,plot=T)
fit1<-lmfit(y,design)
fit2<-contrasts.fit(fit1,cmat)
fit<-eBayes(fit2)
tpt.table(fit,sort.by='p',number=Inf,adjust="BH")

So the first thing I notice about the results if I do it these two different ways is that the results are different. Generally, the first one tends to be more conservative. Is there a better or right way statistically speaking?

In the case of my specific experiment this method, of breaking the dataset into two sets by cluster and running DEG analysis separately, yielded no statistically significant genes.

I figured to increase the power I might use the whole dataset and then treat cluster as another batch variable like so

design<-model.matrix(~Group+Cluster+Batch)
y<-voom(X0, design,plot=T)
fit1<-lmfit(y,design)
fit<-eBayes(fit1)
tpt.table<-topTable(fit,coef='GroupG',sort.by='p',number=Inf,adjust="BH")

Doing it this way now yielded quite a number of significant genes, even if the number of samples is similar with doing independently by cluster (say cut each cluster in half) , it still performs well which makes me somewhat suspicious.

Additionally, you can approach it using the make contrasts way

design<-model.matrix(~0+Group+Cluster+Batch)
cmat<-makeContrasts(diff=GroupG-GourpP,levels=design)
y<-voom(X0,design,plot=T)
fit1<-lmfit(y,design)
fit2<-contrasts.fit(fit1,cmat)
fit<-eBayes(fit2)
tpt.table<-topTable(fit,sort.by='p',number=Inf,asjust="BH")

doing it this way led to a 50% increase in DEGs that met a 0.05 BH corrected significance cutoff. So why are they so drastically different? Should I take another approach such as not model cluster as a batch effect but as an additional effect parameter than form a contrasts matrix that only looks at the Groups?. Any help would be much appreciated.

limma voom • 2.7k views

ADD COMMENT • link updated 24 months ago by Gordon Smyth 50k • written 2.0 years ago by jmannhei ▴ 20

0

Entering edit mode

Using ~Group without the 0+ is correct. I will explain why when I have more time.

ADD REPLY • link 2.0 years ago Gordon Smyth 50k

score 0 · Answer 1 · 2022-04-27

0

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 54 minutes ago

WEHI, Melbourne, Australia

You should use

design <-  model.matrix(~Group+Batch)

rather than

design <- model.matrix(~0+Group+Batch)
cmat <- makeContrasts(diff=GroupG-GroupP,levels=design)

Similarly for the second scenario with Cluster in the model.

The reason is that contrasts.fit gives approximate results when there are voom weights and more than one explanatory factor in the model. See the Note in help("contrasts.fit") for an explanation. I have never been able to fix this issue because so many interlocking things in the limma pipeline rely on things working at they do. In most cases, the results from contrasts.fit are not very difference from the exact pipeline. It is not more or less conservative in general, it can go either way.

Naturally analysing both clusters together with

design<-model.matrix(~Group+Cluster+Batch)

will give more statistical power, so that's not a surprise. The combined analysis allows the Batch effects to be estimated better, so the cumulative improvement is better even than just doubling the number of samples. You don't want it to perform well?

There are aspects of your study that I don't understand. Why are you stratifying by Cluster? What is happening to the other half of the samples that are not in the top or bottom quarter in terms of response?

ADD COMMENT • link 24 months ago Gordon Smyth 50k

0

Entering edit mode

Dr. Smyth,

Thank you for still taking the time to answer questions after all you have contributed. Basically the design of study is this, the clusters pertain to two different Kaplan-Meier curves that have significantly different median survival times when samples are clustered out by k-means. However, given that these are continuous curves in each curve there is still samples in each curve that do well and those that do poorly. I want to know if there are any DEGs between the extremes of those curves so I have cut out half the samples. I first considered doing each curve separately and got no genes when correcting for multiple testing, then I figured I could treat cluster like a batch effect and use

design<-model.matrix(~Group+Cluster+Batch)

Here is the kicker, and why I do not know if my experimental design is correct, one cluster is roughly twice the size of the other cluster. So Say for example I do DEG analysis on 60 samples. First if I again use only the ends of one cluster I get no significant DEGs; however, if I use both clusters, again only 60 samples, I get a fair amount of DEGs. So I am wondering maybe if I should not be treating cluster as a batch effect and maybe an additional covariate in the linear model? Thank you for your help

ADD REPLY • link 24 months ago jmannhei ▴ 20

0

Entering edit mode

It makes no difference that one cluster is larger than the other. That causes no problem for any of the analyses.

I am not seeing what the problem is. From what you report, the results seem much as one would expect.

There are a million ways to analyse your data, and the best appoach depends on scientific priorities rather being than a statistical or software issue, so I can't advise you on it.

There is nothing wrong in principle with the joint model design<-model.matrix(~Group+Cluster+Batch). It is perfectly ok to include Cluster as a batch effect. The combined analysis does assume that the group effect is similar for both clusters and that samples have similar variability within each of the clusters, assumptions not made by the separate analyses. You could consider whether these assumptions are true and the model could be enhanced to handle those things if necessary. But you have to make decisions like that by plotting the data rather than by counting DE genes.

Sure, you can add extra covariates to the model, but that has nothing to do with the difference between the separate and combined analyses.

ADD REPLY • link 24 months ago Gordon Smyth 50k