Question: How to quantify or decide whether a covariate should be included in the linear model for differential expression analysis ?
gravatar for heyao
10 weeks ago by
heyao30 wrote:

Hi everybody,

I am not sure if I asked a totally stupid or unnecessary question here, but this is an actual question bothering me for a long time after I learned how to perform differential expression analysis using tools like limma/edgeR/DESeq recently.

We all know that the mainstream DE analysis is build on (generalized) linear model framework, and such framework is flexible to correct effect caused by other covariates such as age and gender.  In the classic linear model analysis , we could evaluate how well a model represents given data by looking at r-square etc and do model comparison to choose the "best" model for our data.

However, I didn't see much about such discussion on the DE analysis.  No matter what model we chosen, we could get logFC and p value for each gene finally. Tutorial always tell us we could include age or gender into linear model but little is about is that any influence if I include/exclude more covariates, and how to quantify such influence until we can decide which is the "best" model for my data ?

Any suggestion or comments would be appreciated , thanks in advance.




limma edger deseq2 • 118 views
ADD COMMENTlink modified 10 weeks ago by James W. MacDonald49k • written 10 weeks ago by heyao30
Answer: How to quantify or decide whether a covariate should be included in the linear m
gravatar for James W. MacDonald
10 weeks ago by
United States
James W. MacDonald49k wrote:

In general, the model used for high-throughput analyses is over-specified, because you don't know for which genes you will need the extra covariates and for which genes you will not. In conventional linear modeling, where you have just a few outcome variables that you are interested in, you can spend the time to craft the 'best' model for your data (although George Box had a pretty good quote about models, even in such circumstances).

For high-throughput analyses you can't really decide on the 'best' model for your data, because usually you don't have enough degrees of freedom to do much but the simplest model, and even if you did have lots of replication, one model might be really good for a particular set of genes, but not good at all for another. And with tens of thousands of simultaneous model fits, how would you ever finish the analysis if you were to examine the model fit for each gene?

ADD COMMENTlink written 10 weeks ago by James W. MacDonald49k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 375 users visited in the last hour