Question

EdgeR: retrieve Log-Likelihood from glmFit

0

Entering edit mode

cedric.gobet • 0

@cedricgobet-9745

Last seen 9.9 years ago

Hello,

In order to use a model selection approach, I would like to compute BIC or DIC values. Unfortunately, I can't retrieve the LL for my fit. Is there any way I can use the deviance to find it knowing that Residual Deviance = 2(LL(Saturated Model) - LL(Proposed Model)) on df ? But if it's the case I would need the LL for the saturated model and I can't compute it...

Best

Cédric

edger bic mode selection • 1.4k views

ADD COMMENT • link 9.9 years ago cedric.gobet • 0

score 0 · Answer 1 · 2016-02-18

Does it really matter that you can't get the exact log-likelihood? From what I understand, you interpret the change in the BIC between proposed models to decide which one to use. As the log-likelihood of the saturated model does not depend on the different proposed models, you should get the same change in the BIC regardless of whether you plug in the (negative) deviance or the log-likelihood of the proposed model in the BIC expression.

That being said, I wonder whether you need to do formal model selection at all. Things like the various IC's are geared towards choosing the best-performing model for prediction (e.g., to predict the expression of each gene in a new sample based on its combination of factors). However, when we use edgeR for data analysis, we are typically trying to explain our results in terms of various experimental factors. To this end, the simplest model that contains all the factors of interest should suffice. In contrast, model selection via the IC's tends to produce larger models that do well at prediction but not at explanation, as their coefficients are not easily interpreted.

(On occasion, additional blocking factors may be required, e.g., to account for batch effects, but you can just try fitting the model with and without them and see if you get a substantially greater number of DE genes. If either model gives good results, i.e., decent numbers of DE genes, then you don't need to be too worried about selecting one or the other. It doesn't have to be "right" as long as it's useful.)

score 0 · Answer 2 · 2016-02-19

0

Entering edit mode

cedric.gobet • 0

@cedricgobet-9745

Last seen 9.9 years ago

You are right concerning the BIC computation, the log-likelihood of the saturated model is just an independant constant then I can directly plug-in the deviance.

Concerning the model selection, the idea is a little bit different and the goal is to "cluster" genes together. In my case, I have five conditions and I would like to group genes that "behave" in the same way. I generate all the possible combinations of factors (additive) and their respective design matrices (from the Null model to the saturated one) leading to 52 "models" (i.e Bell number). Finally, each gene is associated with one "model" and further enrichment analysis (eg. GO term, TF) are done in each cluster. Confidence in the model is achieved by computing schwartz weight.

Best,

Cédric

ADD COMMENT • link 9.9 years ago cedric.gobet • 0

0

Entering edit mode

Hmm... well, each group of genes will have the same "optimal model", but that doesn't guarantee that they'll behave the same way. For example, in one group associated with a particular model, some genes might have a large positive coefficient (e.g., for a log-fold change term) while others would have a large negative coefficient. Clearly those genes are not behaving in a similar way across samples - indeed, they're behaving in the opposite way - but you'll still end up putting them in the same group. If you extend this issue to multiple coefficients, then the nature of the groups will become quite difficult to interpret.

ADD REPLY • link 9.9 years ago Aaron Lun ★ 29k