Search
Question: Removing continuous covariate effects in limma analysis
4
gravatar for Pauly Lin
2.6 years ago by
Pauly Lin60
University of New South Wales, Australia
Pauly Lin60 wrote:

Dear all, 

I'm performing a differential expression analysis on 12 microarray samples (6 disease vs 6 controls) using limma. My objective is to find genes that are differentially expressed over the main factor "group" (disease vs control). There is also a continuous covariate which can impact on the gene expressions; I'm not interested in the covariate but I need to remove its effects so that it doesn't interfere with my differential expression analysis. I have tried three approaches:

Approach 1: Use removeBatchEffect() to remove the effects of the covariate and then perform differential expression analysis on only the main factor "group" on the output of removeBatchEffect():

x <- removeBatchEffect(data, covariate, design=model.matrix(~group))  
fit <- eBayes(lmFit(x, model.matrix(~group)))
y <- topTable(fit, coef=2, number=Inf)
DEgenes <- y[y$adj.P.Val<0.05,]

This gives me around 1000 differentially expressed genes.

Approach 2: Do a linear model on both covariate and group, and find which genes have significant group coefficients:

fit <- eBayes(lmFit(data, model.matrix(~covariate+group)))
y <- topTable(fit, coef=3, number=Inf)
DEgenes <- y[y$adj.P.Val<0.05,]

This gives me around 100 differentially expressed genes.

Approach 3: Do a linear model on both covariate and group first, and consider the significance of the coefficients of covariate - around 3000 genes have significant covariate coefficients; then eliminate those covariate significant genes and perform differential expression analysis on the remaining genes on only the main factor "group" (disease vs control). This approach gives me around 300 differentially expressed genes. 

My questions are: 1. From reading previous posts, I know that Approach 1 is not appropriate, but I still would like to understand why. If removeBatchEffect() removes the effects of the covariate, shouldn't its output be suitable for a differetial expression analysis on only the main factor "group", and produce a similar result as Approach 2? Why do the two approaches produce different results? What's Approach 1 really doing?

2. Approach 2 performs linear regression on covariate (as well as group) on all the genes, but if covariate has significant coefficients on only some genes, isn't it better if it's not included in the analysis of the remaining genes, as in Approach 3?

Cheers

Paul

 

ADD COMMENTlink modified 9 weeks ago by Gordon Smyth32k • written 2.6 years ago by Pauly Lin60
9
gravatar for Gordon Smyth
2.6 years ago by
Gordon Smyth32k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth32k wrote:

Approach 1 is effectively fitting the same model as in Approach 2, but in an underhand way. In Approach 1, the lmFit() and eBayes() functions don't know that the effects for the covariate have already been removed, so they think that you have one more residual degree of freedom than you actually have, so they underestimate the error variance. Hence the t-statistics are too big and the p-values are too small. For your experiment, the error variances will be underestimated by a factor of about 9/10.

Approach 2 is the standard method that is generally appropriate.

Approach 3 could be appropriate for some special situations where you are certain that the covariate only affects a subset of genes, and you are not interested in the group effect for any of those genes, and the covariate effect is sufficiently large than you can reliably detect it in a differential expression analysis. This is an unusual situation however.

ADD COMMENTlink modified 2.6 years ago • written 2.6 years ago by Gordon Smyth32k

Thanks for your advice, Gordon. I had a discussion with my supervisor, and also showed her the comments on this page. We decided that in our case there is a very clear separation of genes affected by the covariate and those not affected by the covariate, and only a small set of genes are affected by the covariate. Therefore, we decide to perform limma differential expression analysis only on the non-affacted genes and only on the main factor "group"; with regard to those genes affected by the covariate, I'm not sure what to do about them, so I think we will just discard them.

ADD REPLYlink written 2.6 years ago by Pauly Lin60
0
gravatar for Steve Lianoglou
2.6 years ago by
Genentech
Steve Lianoglou12k wrote:

For your first question, I'd start by plotting the logFC (or, perhaps better yet, the t statistics) from approach 1 and 2 against each other to see how "real" these differences are (in your call to topTable, set number=Inf and sort.by="none" to make your life easier)

It might be that that slight changes in your significance thresholds can reasonably explain the majority of the differences -- in which case, you probably don't care "too much" (although 100 vs 1000, perhaps, "sounds" like a lot). This will also help you find which results are vastly different from each other, and therefore warrant more scrutiny.

ADD COMMENTlink modified 2.6 years ago • written 2.6 years ago by Steve Lianoglou12k
0
gravatar for Pauly Lin
2.6 years ago by
Pauly Lin60
University of New South Wales, Australia
Pauly Lin60 wrote:

Thanks, everyone. I have investigated a bit - the first and second approaches give very different p-values, but the genes are more or less in the same order, therefore I think the differences are indeed caused by the loss of one residual degree of freedom in the second approach, ie the first approach is wrong because the t-statistics are inflated. 

Gordon, thanks for your comment on Approach 3. I think I will go ahead with Approach 3, because in my case the covariate indeed only affects a subset of genes, and I don't want to waste one residual degree of freedom to estimate the coefficients of the covariate for the remaining non-affected genes. For the subset of genes affected by the covariate, which of the following approaches is the most appropriate?

1. Use the adjusted p-values for "group" obtained from the first linear model (the one on all genes and on both covariate and group) to decide whether any of these genes are differentially expressed over "group";
2. Do another linear model on only this subset of affected genes on both covariate and group and check which genes have significant coefficients for group;
3. I really should discard this subset of genes. 

Cheers

Paul

ADD COMMENTlink modified 2.6 years ago by Ryan C. Thompson6.1k • written 2.6 years ago by Pauly Lin60
1

Generally I would stick with Approach 2, as Gordon recommends. Approach 3 would really only be appropriate if there was a very clear separation between genes affected by the covariate and genes not affected by it, which is very unlikely to be the case. Also, Approach 3 might be prone to compromising false positive control, since you are performing multiple tests for each gene. For instance, if the covariate happened to be strongly correlated with group, then filtering based on the covariate would be nearly equivalent to filtering based on differential expression between groups before testing for differential expression between groups, which is clearly an invalid procedure. If the group and covariate are truly uncorrelated, then Approach 3 might be valid, but I would consult your local statistician about that.

 

ADD REPLYlink written 2.6 years ago by Ryan C. Thompson6.1k

Thanks for your advice, Ryan. I will discuss this with others in my lab. 

Paul

ADD REPLYlink written 2.6 years ago by Pauly Lin60
0
gravatar for Pauly Lin
2.6 years ago by
Pauly Lin60
University of New South Wales, Australia
Pauly Lin60 wrote:

Thanks, all.

Cheers

Paul

ADD COMMENTlink modified 2.6 years ago • written 2.6 years ago by Pauly Lin60
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 338 users visited in the last hour