Question

EdgeR full model vs. reduced model - p-value for covariate

0

Entering edit mode

Lucy ▴ 60

@lucy-17014

Last seen 10 days ago

United Kingdom

Hi,

I am trying to decide which covariates I should include in my model for EdgeR. To do this, I have read that I should compare a full model including the covariate and a reduced model that lacks the covariate using a likelihood ratio test. How would I go about this in EdgeR and what metrics should I use to decide whether to include the covariate (e.g. patient age) or not.

Many thanks!

Lucy

EdgeR differential expression analysis covariates • 2.0k views

ADD COMMENT • link updated 3.8 years ago by Gordon Smyth 50k • written 3.8 years ago by Lucy ▴ 60

score 2 · Answer 1 · 2020-07-07

2

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 5 hours ago

United States

In conventional linear modeling you would do what you say; fit a full and reduced model and then compare the two to see which fits the data better. But when you are planning to fit maybe 20,000 models, it's not so simple, because the covariate may be significant for some subset of the genes, but not the others. In which case what do you do?

You could just fit a possibly over-parameterized model to all the genes, and accept the fact that for some of the genes you are losing degrees of freedom to fit a coefficient that isn't necessary, and doesn't improve the model. This is what people usually do, so long as they have the degrees of freedom to spare.

An alternative would be to fit the models including (for example) age, and then test the age coefficient. If < N genes have a significant age coefficient, where N is some ad hoc number that you choose, then you could say it's not useful for enough genes and omit it.

And these days people use a quasi-likelihood F test with edgeR, rather than the likelihood ratio test, and you should too.

ADD COMMENT • link 3.8 years ago James W. MacDonald 65k

0

Entering edit mode

Great thank you. And how would I find out whether the coefficient for age is significant or not? Is this stored somewhere in the output of glmQLFTest if my model is ~age + disease. I couldn't locate the p-values for the coefficients?

ADD REPLY • link 3.8 years ago Lucy ▴ 60

0

Entering edit mode

Have you read the edgeR User's Guide? There are multiple examples that should answer your question.

ADD REPLY • link 3.8 years ago James W. MacDonald 65k

0

Entering edit mode

Thank you, I don't think I understood correctly when I was reading it. I know how to test for differential expression based on a condition of interest but I always get confused when it comes to the covariates - I couldn't find anything specifically relating to testing the significance of covariates in the EdgeR manual (unless I am missing it?). If I just set my model as ~disease + age, would this provide me with the correct info?

ADD REPLY • link 3.8 years ago Lucy ▴ 60

2

Entering edit mode

In a linear model, a 'condition of interest' and a 'covariate' or 'nuisance variable' are not different. They are all just independent variables that we think might explain something about the observed gene expression. It so happens that age in your model is a nuisance variable, that you think might affect the gene expression, but isn't of interest otherwise. But you test it the same way you test a condition of interest.

Presuming you have two disease states, and age is continuous (rather than a factor, which wouldn't make sense), then your third coefficient captures the slope for age. And using topTags to select that coefficient will give you the genes that are significantly affected by age, just like using topTags to select the disease coefficient tells you the genes that vary between the two disease states.

ADD REPLY • link 3.8 years ago James W. MacDonald 65k

0

Entering edit mode

Thank you, that makes sense - I always struggle to get my head around these things.

ADD REPLY • link 3.8 years ago Lucy ▴ 60

0

Entering edit mode

But do note that your model forces the slope to be the same for both disease states, which might not actually be the case for any number of genes. If you use ~ disease*age you will allow separate intercepts and slopes for the (assumed by me) two disease states, and you would test the disease:age coefficient to see if there is evidence that the gene expression is affected by a combination of disease and age.

ADD REPLY • link 3.8 years ago James W. MacDonald 65k

score 1 · Answer 2 · 2020-07-07

1

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 1 hour ago

WEHI, Melbourne, Australia

We recommend that you use FDRs to assess whether a variable in significant, and it is exactly the same for a covariate as for any other variable. Generally speaking, a covariate should be included if it is highly significant for a large number of genes or can be ignored if it has only moderate significance for a limited number of genes, especially if those genes are not of primary interest to you.

To find out whether the coefficient for age is significant:

design <- model.matrix(~age + disease)
qltest <- glmQLFTest(fit, design, coef="age")
topTags(qltest)

To find the number of genes for which age is significant:

results <- decideTests(qltest)
summary(results)

To find out what is contained in the qltest object:

help("DGELRT-class")

It would also be relevant to examine the size and distribution of the age coefficients visually by

plotMD(qltest, status=results)

ADD COMMENT • link 3.8 years ago Gordon Smyth 50k

0

Entering edit mode

Thank you, this is really helpful. How do you deal with covariates such as sequencing pool that are factors. Say I had 4 different sequencing pools, would I have to do pairwise comparisons between each of these, or how would I test whether pool is having an effect?

ADD REPLY • link 3.8 years ago Lucy ▴ 60

0

Entering edit mode

Follow the documentation for factors and F-tests. Just use coef=... where ... is a vector of all the coefficients for that factor.

BTW, I always use the term "covariate" to mean a continuous variable rather than a factor.

ADD REPLY • link 3.8 years ago Gordon Smyth 50k