Question

How to quantify or decide whether a covariate should be included in the linear model for differential expression analysis ?

0

Entering edit mode

heyao ▴ 30

@heyao-14543

Last seen 4.1 years ago

Hi everybody,

I am not sure if I asked a totally stupid or unnecessary question here, but this is an actual question bothering me for a long time after I learned how to perform differential expression analysis using tools like limma/edgeR/DESeq recently.

We all know that the mainstream DE analysis is build on (generalized) linear model framework, and such framework is flexible to correct effect caused by other covariates such as age and gender. In the classic linear model analysis , we could evaluate how well a model represents given data by looking at r-square etc and do model comparison to choose the "best" model for our data.

However, I didn't see much about such discussion on the DE analysis. No matter what model we chosen, we could get logFC and p value for each gene finally. Tutorial always tell us we could include age or gender into linear model but little is about is that any influence if I include/exclude more covariates, and how to quantify such influence until we can decide which is the "best" model for my data ?

Any suggestion or comments would be appreciated , thanks in advance.

limma edgeR deseq2 • 1.2k views

ADD COMMENT • link updated 5.3 years ago by James W. MacDonald 65k • written 5.3 years ago by heyao ▴ 30

score 3 · Answer 1 · 2019-01-03

In general, the model used for high-throughput analyses is over-specified, because you don't know for which genes you will need the extra covariates and for which genes you will not. In conventional linear modeling, where you have just a few outcome variables that you are interested in, you can spend the time to craft the 'best' model for your data (although George Box had a pretty good quote about models, even in such circumstances).

For high-throughput analyses you can't really decide on the 'best' model for your data, because usually you don't have enough degrees of freedom to do much but the simplest model, and even if you did have lots of replication, one model might be really good for a particular set of genes, but not good at all for another. And with tens of thousands of simultaneous model fits, how would you ever finish the analysis if you were to examine the model fit for each gene?