Question

Variance explained (coefficient of determination) in edgeR

0

Entering edit mode

Gemma • 0

@2509afca

Last seen 23 months ago

Austria

I would like to compute the variance explained (i.e. coefficient of determination, R2) by a model in edgeR. Concretely, I am modelling various gene expression phenotypes using glmFit, and determining the significants of a few predictors using glmLRT.

Could you please indicate how to compute R2 from the output of these models?

Many thanks in advance!

edgeR • 1.2k views

ADD COMMENT • link 23 months ago Gemma • 0

score 2 · Accepted Answer · 2022-05-11

2

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 1 hour ago

WEHI, Melbourne, Australia

The concept of "variance explained" does not apply to generalized linear models, but you can compute the proportion of deviance explained.

Create a null design matrix

design.null <- matrix(1, nsamples, 1)

where nsamples in the number of samples in your analysis.

Run glmFit with design.null and with your full design matrix design to get fitted model objects fit.null and fit. Then

R2 <- ( fit.null$deviance - fit$deviance ) / fit.null$deviance

ADD COMMENT • link 23 months ago Gordon Smyth 50k

0

Entering edit mode

Many thanks, that seems to work!

However, the estimates tend to be surprisingly high, for many genes R2 being close to 1 (quartiles being 0%: 1.879e-13; 25%: 0.0152; 50%: 0.1009; 75%: 0.5773; 100%: 0.9997). Is this distribution as expected?

I would also like to ask a follow-up question regarding this other post: Variance explained (coefficient of determination) in glmFit / glmLRT

I am actually fitting a model with three predictors and trying to compute the proportion of deviance explained by each of the predictors. I understand that this relies on making a choice on the order in which the predictors are included in the model (which others we are correcting for). A conservative and consistent way of computing R2_pred1 might be by subtracting R2 computed considering only pred2 and pred3 from that of the full model (considering all three predictors). Something like this:

design.null = matrix(1, nsamples, 1)
design.full = model.matrix(~x1+x2+x3)
design.nox1 = model.matrix(~x2+x3)
design.nox2 = model.matrix(~x1+x3)
design.nox3 = model.matrix(~x1+x2)

fit.null= glmFit(y,design.null)
fit.full <- glmFit(y, design)
fit.nox1 <- glmFit(y, design.nox1)
fit.nox2 <- glmFit(y, design.nox2)
fit.nox3 <- glmFit(y, design.nox3)

R2_full <- ( fit.null$deviance - fit.full$deviance ) / fit.null$deviance
R2_nox1 <- ( fit.null$deviance - fit.nox1$deviance ) / fit.null$deviance
R2_nox2 <- ( fit.null$deviance - fit.nox2$deviance ) / fit.null$deviance
R2_nox3 <- ( fit.null$deviance - fit.nox3$deviance ) / fit.null$deviance

R2_x1=R2_full-R2_nox1
R2_x2=R2_full-R2_nox2
R2_x3=R2_full-R2_nox3

Would this be correct?

Many thanks again!

ADD REPLY • link 23 months ago Gemma • 0

0

Entering edit mode

The R2 values look completely normal. The median R2 is 10%, which seems somewhat low rather than high. With so many genes, you will naturally get some R2 over the whole range from 0 to 1, just by chance variation, which is what you see.

Even if none of the genes are differentially expressed and the data was just random, you would still expect to get R2 values around 3 / (nsamples - 1) on average.

Regarding the predictor specific R2, I don't know what you're trying to do. Your R2_nox1 is the proportion of the deviance that x1 contributes over and above nox2 and nox3, but I don't know why you are computing R2_x1 etc. There is no right or wrong here. You're just computing descriptive statistics.

ADD REPLY • link 23 months ago Gordon Smyth 50k

0

Entering edit mode

Many thanks for your answer.

To clarify the last point, R2_nox1 is the proportion of the deviance explained when considering design.nox1 = model.matrix(~x2+x3), i.e. without x1. So to compute the deviance contributed by x1 I am subtracting R2_nox1 from the deviance explained by when considering all the predictors (R_full): R2_x1 = R2_full - R2_nox1.

The idea is to compute the deviance contributed only by x1, by removing the contributions of x2and x3 from the total explained variance. Is this one possible way of computing the deviance contributed by one variable when accounting for that contributed by the rest? Otherwise, could you please propose an alternative?

Thanks again!

ADD REPLY • link 23 months ago Gemma • 0

0

Entering edit mode

No, you have it wrong way around.

If you want deviance explained by x1, then you need

R2_x1 <- ( deviance without x1 in model - deviance with x1 in model ) / fit.null$deviance

You do not need to subtract one R2 from another.

ADD REPLY • link 23 months ago Gordon Smyth 50k

0

Entering edit mode

Both strategies actually lead to the same results. Resolved, thanks!

ADD REPLY • link 23 months ago Gemma • 0