I'm performing a differential expression analysis on 12 microarray samples (6 disease vs 6 controls) using limma. My objective is to find genes that are differentially expressed over the main factor "group" (disease vs control). There is also a continuous covariate which can impact on the gene expressions; I'm not interested in the covariate but I need to remove its effects so that it doesn't interfere with my differential expression analysis. I have tried three approaches:
Approach 1: Use removeBatchEffect() to remove the effects of the covariate and then perform differential expression analysis on only the main factor "group" on the output of removeBatchEffect():
x <- removeBatchEffect(data, covariate, design=model.matrix(~group)) fit <- eBayes(lmFit(x, model.matrix(~group))) y <- topTable(fit, coef=2, number=Inf) DEgenes <- y[y$adj.P.Val<0.05,]
This gives me around 1000 differentially expressed genes.
Approach 2: Do a linear model on both covariate and group, and find which genes have significant group coefficients:
fit <- eBayes(lmFit(data, model.matrix(~covariate+group))) y <- topTable(fit, coef=3, number=Inf) DEgenes <- y[y$adj.P.Val<0.05,]
This gives me around 100 differentially expressed genes.
Approach 3: Do a linear model on both covariate and group first, and consider the significance of the coefficients of covariate - around 3000 genes have significant covariate coefficients; then eliminate those covariate significant genes and perform differential expression analysis on the remaining genes on only the main factor "group" (disease vs control). This approach gives me around 300 differentially expressed genes.
My questions are: 1. From reading previous posts, I know that Approach 1 is not appropriate, but I still would like to understand why. If removeBatchEffect() removes the effects of the covariate, shouldn't its output be suitable for a differetial expression analysis on only the main factor "group", and produce a similar result as Approach 2? Why do the two approaches produce different results? What's Approach 1 really doing?
2. Approach 2 performs linear regression on covariate (as well as group) on all the genes, but if covariate has significant coefficients on only some genes, isn't it better if it's not included in the analysis of the remaining genes, as in Approach 3?