Question

Limma and batch effect

2

Entering edit mode

lirongrossmann ▴ 80

@lirongrossmann-13938

Last seen 4.3 years ago

Hi All,

I have a dataset which contains two batches. I wanted to remove that batch effect and further analyze the data (to perform differential analysis and clustering). I used two approaches:

1. giving the batch and the biological variable of interest to the lmFit:

fit = lmFit(y, design=model.matrix(~ batch+fac))
fit = eBayes(fit)
pvals = fit$p.value[,2]
adjplim<-p.adjust(pvals,"BH",length(pvals))

and got more than 1000 genes differentially expressed between the groups defined by fac.

2.

y2 <- removeBatchEffect(y,batch, design =model.matrix(~fac))

fit = lmFit(y2, design=model.matrix(~ fac))

fit = eBayes(fit)

pvals = fit$p.value[,2]
adjplim<-p.adjust(pvals,"BH",length(pvals))

and got 0 genes differentially expressed between the groups defined by fac.

What am I doing wrong here?

The first approach doesn't give me a way to get a matrix which is clean from batch effect (like y2 in the second approach), so I used the second approach, but I don't know which method to rely on in terms of the differential gene expression.

Thanks!

Liron

limma differential gene expression clustering removebatcheffect • 12k views

ADD COMMENT • link updated 7.4 years ago by Gordon Smyth 52k • written 7.4 years ago by lirongrossmann ▴ 80

0

Entering edit mode

Thank you for your answers!

I want to use the genes I get from the differential analysis (with the batch removed) to see if I get clear clusters of the biologic groups I have (defined by fac) after the batch was removed.

Do I have to give the removeBatchEffect() both the batch and the design (model.matrix(~fac)) as arguments?

Does that mean that each time I want to test a hypothesis between different biological groups I need to run removeBatchEffect() with different arguments? (only the batch will be the same).

Thanks a lot for your help!

Liron

ADD REPLY • link 7.4 years ago lirongrossmann ▴ 80

score 4 · Answer 1 · 2017-09-11

I don't think your first approach is doing what you want. Extracting fit$p.value[,2] will get you the p-values corresponding to the batch effect, rather than the effect of fac. You should be using topTable with coef= to specify the coefficient to be tested (probably the last one).

Now, assuming that you're testing the same thing, the first approach is better, i.e., putting the Batch factor in the design matrix. This is because it allows lmFit to accounts for the uncertainty of estimation for the batch effect. The second approach estimates the batch effect, treats it as known, and then subtracts it from the expression values, after which lmFit has no way of knowing the uncertainty of batch estimation.

P.S. Separate the limma tag, otherwise the maintainers won't see your post.

Gordon Smyth · Answer 2 · 2017-09-11

4

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 42 minutes ago

WEHI, Melbourne, Australia

To understand removeBatchEffects() you could look at previous answers remove microarray batch effects using Limma and removeBatchEffect options: design and covariates and Removing continuous covariate effects in limma analysis

However, removeBatchEffects() is never needed for a DE analysis and I suspect it is just a distraction here. I'd much rather than you left removeBatchEffects() aside for the time being and just concentrated on getting the DE analysis correct.

The reason why you have different results from approaches 1 and 2 has little to do with removeBatchEffects(). The real reason (as Aaron has pointed out) is that in approach 1 you tested for the batch effect whereas in approach 2 you tested for the biological groups. The results show very clearly that your data has a strong batch effect but little or no DE between the biological groups.

Have you done some basic exploration of your data using plotMDS() etc? I suspect you will see a strong separation between the two batches. Do you also see any separation between the biological groups?

How many biological groups do you have? Just two, or more than two?

The right way to do the DE analysis is the same as you'll see in the limma User's Guide. Just fit model:

design <- model.matrix(~batch+fac)
fit <- lmFit(y, design)
fit = eBayes(fit)
summary(decideTests(fit))

This will show you how many DE genes there are for each coefficient.

To get a list of DE genes between the second and first biological groups:

topTable(fit, coef=3)

If you have three biological groups you might use

topTable(fit, coef=4)

to get a list of DE genes between the third and first biological groups or

topTable(fit, coef=3:4)

to get a list of genes DE between any of the biological groups (using F-tests).

I am guessing though that you don't actually have any DE genes. If that is so then you will need to go back and revisit the basic data quality rather than proceeding with further downstream analyses.

ADD COMMENT • link 7.4 years ago Gordon Smyth 52k

0

Entering edit mode

Thank you for your explanations!!

I took into account Aaron's advice and have changed my code to the following: ( I assume Aaron meant topTable?)

fit = lmFit(y, design=model.matrix(~batch+fac))
fit = eBayes(fit)
topTable(fit,coef=3 )

In my hypothesis, I am testing two biologic groups (refractory to treatment vs responsive to treatment) . I have two datasets that I want to combine- one is coming from strand-specific RNA-seq and the other is unstranded RNA-seq. I decided to treat it as batch effect (strand-specific vs unstranded). In the first dataset (strand-specific) I have samples which were both responsive and not to treatment (i.e. from the two biologic groups. In the second dataset (the unstranded) I have only samples that were not responsive to the treatment.

So, in the above code y is the merged expression expression (strand-specific and unstranded), batch is a vector of 2 levels (strand or unstrand) and fac is the vector with 2 levels (responsive or non responsive).

Since the strand dataset has both biologic groups the difference between them is not totally masked by the batch effect when I merge it with the unstrand dataset (which contains only the non responsive).

By the way, the output of summary(decideTests(fit)) is:

(Intercept) batchUnstranded     fac
-1      18             3905       0
 0   20842            26304   39769
 1   18911             9562       2

Is there something I am missing here?

Thank you both for your time and help!!!!

Liron

ADD REPLY • link updated 7.4 years ago by Gordon Smyth 52k • written 7.4 years ago by lirongrossmann ▴ 80

1

Entering edit mode

Thank you for the extra information.

The fact that the unstranded dataset contains only one biological group (not responsive) means that it cannot contribute any information to finding DE genes between responsive and not responsive. So there is no value in combining the two datasets and no need to do a batch correction.

Just analyse the stranded dataset. It contains all the relevant information.

ADD REPLY • link 7.4 years ago Gordon Smyth 52k

0

Entering edit mode

Yep, I ran it again with the topTable(fit, coef=3). I initially took topTable(fit,adjust.method= "BH") which was a mistake. My updated comment is the one I ran the last taking into account the code you provided in your comment.

ADD REPLY • link 7.4 years ago lirongrossmann ▴ 80

0

Entering edit mode

Thanks! My thought to add the unstranded data was to increase the power of my test, so I will have more samples, but I guess I will lose more if I try to correct for the batch....

ADD REPLY • link 7.4 years ago lirongrossmann ▴ 80

0

Entering edit mode

Hi!

Thanks again for you help.

I took your advice of working with the strand-specific dataset, and have two more questions please:

1. In the strand-specific dataset the number of non-responsive to treatment samples is 9 and the responsive sample is 197. I tried to cluster the dataset to see if the two groups cluster based on the DE genes I found and there is some clustering. I suspect that the 197 samples are very heterogenous (I know they from the clinical perspective). Would matching the 9 non responsive samples to 9 responsive samples (in terms of age, gender, specific protocol of treatment, etc...) be the best way to account for that? I know I am going to loose a lot of power reducing my dataset to 18 samples, so I was not sure what would be the best way.....

Update: I tried to give limma a matrix model that contains all the variables that maybe confounding in addition to the variable of the group of interest so that limma can adjust for the other variables. Using the entire dataset (9+197), I got 124 DE genes between the responsive and non responsive groups (with the other variables adjusted), but when I cluster the dataset I still don't see clear separation between the two groups (the 9 and the 197). What can be an explanation to that?

2. If I want to take the DE genes I find with the strand-specific dataset and use them to guide an analysis on the unstranded dataset (where I have many more non responsive to treatment samples including pre and post treatment samples), do I need to make any corrections since the RNA-seq was performed using different technologies?

Thanks for all your help!!!

(if you want me to add more tags to those questions I would be happy to)

Liron

ADD REPLY • link 7.4 years ago lirongrossmann ▴ 80