Question: Limma and batch effect
1
22 months ago by
lirongrossmann40 wrote:

Hi All,

I have a dataset which contains two batches. I wanted to remove that batch effect and further analyze the data (to perform differential analysis and clustering). I used two approaches:

1. giving the batch and the biological variable of interest to the lmFit:

fit = lmFit(y, design=model.matrix(~ batch+fac))
fit = eBayes(fit)
pvals = fit$p.value[,2] adjplim<-p.adjust(pvals,"BH",length(pvals)) and got more than 1000 genes differentially expressed between the groups defined by fac. 2. y2 <- removeBatchEffect(y,batch, design =model.matrix(~fac)) fit = lmFit(y2, design=model.matrix(~ fac)) fit = eBayes(fit) pvals = fit$p.value[,2]

and got 0 genes differentially expressed between the groups defined by fac.

What am I doing wrong here?

The first approach doesn't give me a way to get a matrix which is clean from batch effect (like y2 in the second approach), so I used the second approach, but I don't know which method to rely on in terms of the differential gene expression.

Thanks!

Liron

modified 22 months ago by Gordon Smyth37k • written 22 months ago by lirongrossmann40

I want to use the genes I get from the differential analysis (with the batch removed) to see if I get clear clusters of the biologic groups I have (defined by fac) after the batch was removed.

Do I have to give the removeBatchEffect() both the batch and the design (model.matrix(~fac)) as arguments?

Does that mean that each time I want to test a hypothesis between different biological groups I need to run removeBatchEffect() with different arguments? (only the batch will be the same).

Thanks a lot for your help!

Liron

4
22 months ago by
Aaron Lun24k
Cambridge, United Kingdom
Aaron Lun24k wrote:

I don't think your first approach is doing what you want. Extracting fit\$p.value[,2] will get you the p-values corresponding to the batch effect, rather than the effect of fac. You should be using topTable with coef= to specify the coefficient to be tested (probably the last one).

Now, assuming that you're testing the same thing, the first approach is better, i.e., putting the Batch factor in the design matrix. This is because it allows lmFit to accounts for the uncertainty of estimation for the batch effect. The second approach estimates the batch effect, treats it as known, and then subtracts it from the expression values, after which lmFit has no way of knowing the uncertainty of batch estimation.

P.S. Separate the limma tag, otherwise the maintainers won't see your post.

4
22 months ago by
Gordon Smyth37k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth37k wrote:

To understand removeBatchEffects() you could look at previous answers remove microarray batch effects using Limma and removeBatchEffect options: design and covariates and Removing continuous covariate effects in limma analysis

However, removeBatchEffects() is never needed for a DE analysis and I suspect it is just a distraction here. I'd much rather than you left removeBatchEffects() aside for the time being and just concentrated on getting the DE analysis correct.

The reason why you have different results from approaches 1 and 2 has little to do with removeBatchEffects(). The real reason (as Aaron has pointed out) is that in approach 1 you tested for the batch effect whereas in approach 2 you tested for the biological groups. The results show very clearly that your data has a strong batch effect but little or no DE between the biological groups.

Have you done some basic exploration of your data using plotMDS() etc? I suspect you will see a strong separation between the two batches. Do you also see any separation between the biological groups?

How many biological groups do you have? Just two, or more than two?

The right way to do the DE analysis is the same as you'll see in the limma User's Guide. Just fit model:

design <- model.matrix(~batch+fac)
fit <- lmFit(y, design)
fit = eBayes(fit)
summary(decideTests(fit))


This will show you how many DE genes there are for each coefficient.

To get a list of DE genes between the second and first biological groups:

topTable(fit, coef=3)


If you have three biological groups you might use

topTable(fit, coef=4)


to get a list of DE genes between the third and first biological groups or

topTable(fit, coef=3:4)


to get a list of genes DE between any of the biological groups (using F-tests).

I am guessing though that you don't actually have any DE genes. If that is so then you will need to go back and revisit the basic data quality rather than proceeding with further downstream analyses.

I took into account Aaron's advice and have changed my code to the following: ( I assume Aaron meant topTable?)

fit = lmFit(y, design=model.matrix(~batch+fac))
fit = eBayes(fit)
topTable(fit,coef=3 )

In my hypothesis, I am testing two biologic groups (refractory to treatment vs responsive to treatment) . I have two datasets that I want to combine- one is coming from strand-specific RNA-seq and the other is unstranded RNA-seq. I decided to treat it as batch effect (strand-specific vs unstranded).  In the first dataset (strand-specific) I have samples which were both responsive and not to treatment (i.e. from the two biologic groups. In the second dataset (the unstranded) I have only samples that were not responsive to the treatment.

So, in the above code y is the merged expression expression (strand-specific and unstranded), batch is a vector of 2 levels (strand or unstrand) and fac is the vector with 2 levels (responsive or non responsive).

Since the strand dataset has both biologic groups the difference between them is not totally masked by the batch effect when I merge it with the unstrand dataset (which contains only the non responsive).

By the way, the output of   summary(decideTests(fit)) is:

(Intercept) batchUnstranded     fac
-1      18             3905       0
0   20842            26304   39769
1   18911             9562       2

Is there something I am missing here?

Thank you both for your time and help!!!!

Liron

1

Thank you for the extra information.

The fact that the unstranded dataset contains only one biological group (not responsive) means that it cannot contribute any information to finding DE genes between responsive and not responsive. So there is no value in combining the two datasets and no need to do a batch correction.

Just analyse the stranded dataset. It contains all the relevant information.

Yep, I ran it again with the topTable(fit, coef=3). I initially took topTable(fit,adjust.method= "BH") which was a mistake. My updated comment is the one I ran the last taking into account the code you provided in your comment.

Thanks! My thought to add the unstranded data was to increase the power of my test, so I will have more samples, but I guess I will lose more if I try to correct for the batch....

Hi!

Thanks again for you help.

1. In the strand-specific dataset the number of non-responsive  to treatment samples is 9 and the responsive sample is 197. I tried to cluster the dataset to see if the two groups cluster based on the DE genes I found and there is some clustering. I suspect that the 197 samples are very heterogenous (I know they from the clinical perspective). Would matching the 9 non responsive samples to 9 responsive samples (in terms of age, gender, specific protocol of treatment, etc...) be the best way to account for that? I know I am going to loose a lot of power reducing my dataset to 18 samples, so I was not sure what would be the best way.....

Update: I tried to give limma a matrix model that contains all the variables that maybe confounding in addition to the variable of the group of interest so that limma can adjust for the other variables. Using the entire dataset (9+197), I got 124 DE genes between the responsive and non responsive groups (with the other variables adjusted), but when I cluster the dataset I still don't see clear separation between the two groups (the 9 and the 197). What can be an explanation to that?

2. If I want to take the DE genes I find with the strand-specific dataset and use them to guide an analysis on the unstranded dataset (where I have many more  non responsive to treatment samples including pre and post treatment samples), do I need to make any corrections since the RNA-seq was performed using different technologies?

(if you want me to add more tags to those questions I would be happy to)

Liron