DESeq LRT reduced model
Entering edit mode
bio_inc • 0
Last seen 18 months ago


I'm currently exploring the functions in DESeq2 and try to replicate some findings of DOI: 10.1126/science.aan3456. In this paper, they test several models to identify clusters of differentially expressed genes between 3 species (H[uman]/C[himp]/M[acaque]). Some of their models they test are as followed: Model 0: count ~ Batch Model H: count ~ species + batch (2 levels of species: human, chimpanzee/macaque) Model A: count ~ species + batch (3 levels of species: human, chimpanzee, macaque)

To identify clusters of genes where H > C > M, they first test Model H vs reduced model 0. This also works fine in my code, as I can easily drop the complete factor species. However, I can't figure out how to combine chimpanzee/macaque, ultimately having something like:

dds <- DESeqDataSetFromMatrix(countData=counts, colData=metaData, design=~Batch + ModelA)
dds <- DESeq(dds, test="LRT", reduced = ~Batch + ModelH)

As this gives the error:

Error in checkLRT(full, reduced) : 
  the following variables in the reduced formula not in the full formula: ModelH

Does anyone have an idea how to reduce the factor specie from 3 to 2 for the reduced model, so I can compare these 2 models?

Thanks in advance!

deseq2 LRT • 388 views
Entering edit mode
Last seen 15 hours ago
United States

This approach should work:

> species1 <- factor(rep(1:2,c(5,10)))
> species1
 [1] 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
Levels: 1 2
> species2 <- factor(rep(1:2,c(10,5)))
> species2
 [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2

Then just using ~species1 + species2 gives the following design matrix:

> model.matrix(~ species1 + species2)
   (Intercept) species12 species22
1            1         0         0
2            1         0         0
3            1         0         0
4            1         0         0
5            1         0         0
6            1         1         0
7            1         1         0
8            1         1         0
9            1         1         0
10           1         1         0
11           1         1         1
12           1         1         1
13           1         1         1
14           1         1         1
15           1         1         1
Entering edit mode

Thank you for your fast comment, and excuse me for my belated reply!

I have been working on this for the last few weeks and came across some errors in my own code. I am still having doubts on my final comparison, as I want to include batch effects in my models as well.

My first design is: ~Batch+Species (3 levels, human/chimp/mac), vs reduced ~Batch. - This is to identify any gene showing differential expression between the 3 species.

On these identified genes I do post-hoc testing. To do so I added the 2 factors in my metadata as you suggested [species 1, species2], such that human = [1 1], chimp = [1 2], macaque = [2 2].

My first test is to differentially expressed genes in humans versus the combined group. Full model = ~Batch + species1; reduced = ~Batch.

My second post hoc comparison however is to identify the non significant genes between the 3 species comparison controlled for batch effects, and the human vs 2 species. model.matrix(~Batch + Species1 + species2) however returns that the model is not full rank,

Therefore I used the following approach: - Full model: ~Batch + species1 + species2 - Reduced: ~Batch + species1 and extract genes > FDR padj value. However, I am in doubt if I am using this comparison as you have illustrated it.

I'm looking forward to your reply and want to thank you again for your help!

Entering edit mode

If you want to assess if a particular LFC is close to zero, we have a framework for a minimum effect size using lfcThreshold and altHypothesis (see documentation). Here you have to define “close” — you have to come up with an LFC that is close enough to zero for your application.

Entering edit mode


Again thanks for the quick reply. It's not the LFC per se, but the models used in the second post hoc comparison:

using an LRT test i do: Full model: ~Batch + species1 + species2 vs Reduced: ~Batch + species1

and results are res <- results(dds,contrast = c("species2","1","2"))

From these results I extract > padj. It questions me if this is the correct way for a comparison of the 3 species (as represented in the full model by the combination of species 1 & species2) vs a 2 species comparison in the reduced model.

Entering edit mode

This is just an LFC (coefficient) though. Adjusted p-value for this coefficient being large doesn't really represent anything. However, we do have a framework for evidence that this coefficient is between -theta and theta, with FDR bound etc.

Entering edit mode

Interesting! Another approach thus could also be to:

  • Conduct the models as described above
  • Extract genes based on mean counts per group (I found a line of code by you in another post): i.e. all genes with base mean counts of human > chimp & macaques (as LFC is calculated of these values, right?)
  • of these genes extract (non)significant genes (i.e. padj < fdr or > fdr)

Or do I misunderstand you?

Entering edit mode

I don't recommend this -- it's best to avoid interpreting p > alpha or adjusted p > alpha as evidence for something.

Entering edit mode

Thank you for your answer!

As I am trying to replicate the original findings of the paper, they state that their thresholds are the FDR and the 'mean'. In further analyses/illustrations they highlight 2-fold changes. So the approach as described seems to be replicating their approach I believe.

I do understand your argument on the p-values. Will definitely take that into account on further analyses!

Glad to hear that the 3-group approach with species1 + species2 + batch effects seems the right one for the 3 species comparison. Thank you for your help!


Login before adding your answer.

Traffic: 245 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6