Question

Removing continuous covariate effects in limma analysis

9

Entering edit mode

Pauly Lin ▴ 160

@pauly-lin-7537

Last seen 9.5 years ago

University of New South Wales, Australia

Dear all,

I'm performing a differential expression analysis on 12 microarray samples (6 disease vs 6 controls) using limma. My objective is to find genes that are differentially expressed over the main factor "group" (disease vs control). There is also a continuous covariate which can impact on the gene expressions; I'm not interested in the covariate but I need to remove its effects so that it doesn't interfere with my differential expression analysis. I have tried three approaches:

Approach 1: Use removeBatchEffect() to remove the effects of the covariate and then perform differential expression analysis on only the main factor "group" on the output of removeBatchEffect():

x <- removeBatchEffect(data, covariate, design=model.matrix(~group))  
fit <- eBayes(lmFit(x, model.matrix(~group)))
y <- topTable(fit, coef=2, number=Inf)
DEgenes <- y[y$adj.P.Val<0.05,]

This gives me around 1000 differentially expressed genes.

Approach 2: Do a linear model on both covariate and group, and find which genes have significant group coefficients:

fit <- eBayes(lmFit(data, model.matrix(~covariate+group)))
y <- topTable(fit, coef=3, number=Inf)
DEgenes <- y[y$adj.P.Val<0.05,]

This gives me around 100 differentially expressed genes.

Approach 3: Do a linear model on both covariate and group first, and consider the significance of the coefficients of covariate - around 3000 genes have significant covariate coefficients; then eliminate those covariate significant genes and perform differential expression analysis on the remaining genes on only the main factor "group" (disease vs control). This approach gives me around 300 differentially expressed genes.

My questions are: 1. From reading previous posts, I know that Approach 1 is not appropriate, but I still would like to understand why. If removeBatchEffect() removes the effects of the covariate, shouldn't its output be suitable for a differetial expression analysis on only the main factor "group", and produce a similar result as Approach 2? Why do the two approaches produce different results? What's Approach 1 really doing?

2. Approach 2 performs linear regression on covariate (as well as group) on all the genes, but if covariate has significant coefficients on only some genes, isn't it better if it's not included in the analysis of the remaining genes, as in Approach 3?

Cheers

Paul

limma microarray • 13k views

ADD COMMENT • link updated 7.6 years ago by Gordon Smyth 52k • written 10.0 years ago by Pauly Lin ▴ 160

0

Entering edit mode

Thanks, all.

Cheers

Paul

ADD REPLY • link 10.0 years ago Pauly Lin ▴ 160

0

Entering edit mode

For the approach2, data is obtained by :

data <- voom(count, design, plot=T) ?

ADD REPLY • link 7.2 years ago picasa1983 • 0

1

Entering edit mode

Pauly Lin ▴ 160

@pauly-lin-7537

Last seen 9.5 years ago

University of New South Wales, Australia

Thanks, everyone. I have investigated a bit - the first and second approaches give very different p-values, but the genes are more or less in the same order, therefore I think the differences are indeed caused by the loss of one residual degree of freedom in the second approach, ie the first approach is wrong because the t-statistics are inflated.

Gordon, thanks for your comment on Approach 3. I think I will go ahead with Approach 3, because in my case the covariate indeed only affects a subset of genes, and I don't want to waste one residual degree of freedom to estimate the coefficients of the covariate for the remaining non-affected genes. For the subset of genes affected by the covariate, which of the following approaches is the most appropriate?

1. Use the adjusted p-values for "group" obtained from the first linear model (the one on all genes and on both covariate and group) to decide whether any of these genes are differentially expressed over "group";
2. Do another linear model on only this subset of affected genes on both covariate and group and check which genes have significant coefficients for group;
3. I really should discard this subset of genes.

Cheers

Paul

ADD COMMENT • link updated 10.0 years ago by Ryan C. Thompson ★ 7.9k • written 10.0 years ago by Pauly Lin ▴ 160

2

Entering edit mode

Generally I would stick with Approach 2, as Gordon recommends. Approach 3 would really only be appropriate if there was a very clear separation between genes affected by the covariate and genes not affected by it, which is very unlikely to be the case. Also, Approach 3 might be prone to compromising false positive control, since you are performing multiple tests for each gene. For instance, if the covariate happened to be strongly correlated with group, then filtering based on the covariate would be nearly equivalent to filtering based on differential expression between groups before testing for differential expression between groups, which is clearly an invalid procedure. If the group and covariate are truly uncorrelated, then Approach 3 might be valid, but I would consult your local statistician about that.

ADD REPLY • link 10.0 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Thanks for your advice, Ryan. I will discuss this with others in my lab.

Paul

ADD REPLY • link 10.0 years ago Pauly Lin ▴ 160

0

Entering edit mode

Steve Lianoglou ★ 13k

@steve-lianoglou-2771

Last seen 6 days ago

United States

For your first question, I'd start by plotting the logFC (or, perhaps better yet, the t statistics) from approach 1 and 2 against each other to see how "real" these differences are (in your call to topTable, set number=Inf and sort.by="none" to make your life easier)

It might be that that slight changes in your significance thresholds can reasonably explain the majority of the differences -- in which case, you probably don't care "too much" (although 100 vs 1000, perhaps, "sounds" like a lot). This will also help you find which results are vastly different from each other, and therefore warrant more scrutiny.

ADD COMMENT • link 10.0 years ago Steve Lianoglou ★ 13k

score 16 · Accepted Answer · 2015-04-01

Approach 1 is effectively fitting the same model as in Approach 2, but in an underhand way. In Approach 1, the lmFit() and eBayes() functions don't know that the effects for the covariate have already been removed, so they think that you have one more residual degree of freedom than you actually have, so they underestimate the error variance. Hence the t-statistics are too big and the p-values are too small. For your experiment, the error variances will be underestimated by a factor of about 9/10.

Approach 2 is the standard method that is generally appropriate.

Approach 3 could be appropriate for some special situations where you are certain that the covariate only affects a subset of genes, and you are not interested in the group effect for any of those genes, and the covariate effect is sufficiently large than you can reliably detect it in a differential expression analysis. This is an unusual situation however.