DESeq LRT reduced model
1
0
Entering edit mode
bio_inc • 0
@bio_inc-23071
Last seen 18 months ago

Hi,

I'm currently exploring the functions in DESeq2 and try to replicate some findings of DOI: 10.1126/science.aan3456. In this paper, they test several models to identify clusters of differentially expressed genes between 3 species (H[uman]/C[himp]/M[acaque]). Some of their models they test are as followed: Model 0: count ~ Batch Model H: count ~ species + batch (2 levels of species: human, chimpanzee/macaque) Model A: count ~ species + batch (3 levels of species: human, chimpanzee, macaque)

To identify clusters of genes where H > C > M, they first test Model H vs reduced model 0. This also works fine in my code, as I can easily drop the complete factor species. However, I can't figure out how to combine chimpanzee/macaque, ultimately having something like:

dds <- DESeqDataSetFromMatrix(countData=counts, colData=metaData, design=~Batch + ModelA)
dds <- DESeq(dds, test="LRT", reduced = ~Batch + ModelH)


As this gives the error:

Error in checkLRT(full, reduced) :
the following variables in the reduced formula not in the full formula: ModelH


Does anyone have an idea how to reduce the factor specie from 3 to 2 for the reduced model, so I can compare these 2 models?

deseq2 LRT • 388 views
0
Entering edit mode
@mikelove
Last seen 15 hours ago
United States

This approach should work:

> species1 <- factor(rep(1:2,c(5,10)))
> species1
[1] 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
Levels: 1 2
> species2 <- factor(rep(1:2,c(10,5)))
> species2
[1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2


Then just using ~species1 + species2 gives the following design matrix:

> model.matrix(~ species1 + species2)
(Intercept) species12 species22
1            1         0         0
2            1         0         0
3            1         0         0
4            1         0         0
5            1         0         0
6            1         1         0
7            1         1         0
8            1         1         0
9            1         1         0
10           1         1         0
11           1         1         1
12           1         1         1
13           1         1         1
14           1         1         1
15           1         1         1

0
Entering edit mode

Thank you for your fast comment, and excuse me for my belated reply!

I have been working on this for the last few weeks and came across some errors in my own code. I am still having doubts on my final comparison, as I want to include batch effects in my models as well.

My first design is: ~Batch+Species (3 levels, human/chimp/mac), vs reduced ~Batch. - This is to identify any gene showing differential expression between the 3 species.

On these identified genes I do post-hoc testing. To do so I added the 2 factors in my metadata as you suggested [species 1, species2], such that human = [1 1], chimp = [1 2], macaque = [2 2].

My first test is to differentially expressed genes in humans versus the combined group. Full model = ~Batch + species1; reduced = ~Batch.

My second post hoc comparison however is to identify the non significant genes between the 3 species comparison controlled for batch effects, and the human vs 2 species. model.matrix(~Batch + Species1 + species2) however returns that the model is not full rank,

Therefore I used the following approach: - Full model: ~Batch + species1 + species2 - Reduced: ~Batch + species1 and extract genes > FDR padj value. However, I am in doubt if I am using this comparison as you have illustrated it.

0
Entering edit mode

If you want to assess if a particular LFC is close to zero, we have a framework for a minimum effect size using lfcThreshold and altHypothesis (see documentation). Here you have to define “close” — you have to come up with an LFC that is close enough to zero for your application.

0
Entering edit mode

Hi,

Again thanks for the quick reply. It's not the LFC per se, but the models used in the second post hoc comparison:

using an LRT test i do: Full model: ~Batch + species1 + species2 vs Reduced: ~Batch + species1

and results are res <- results(dds,contrast = c("species2","1","2"))

From these results I extract > padj. It questions me if this is the correct way for a comparison of the 3 species (as represented in the full model by the combination of species 1 & species2) vs a 2 species comparison in the reduced model.

0
Entering edit mode

This is just an LFC (coefficient) though. Adjusted p-value for this coefficient being large doesn't really represent anything. However, we do have a framework for evidence that this coefficient is between -theta and theta, with FDR bound etc.

0
Entering edit mode

Interesting! Another approach thus could also be to:

• Conduct the models as described above
• Extract genes based on mean counts per group (I found a line of code by you in another post): i.e. all genes with base mean counts of human > chimp & macaques (as LFC is calculated of these values, right?)
• of these genes extract (non)significant genes (i.e. padj < fdr or > fdr)

Or do I misunderstand you?

0
Entering edit mode

I don't recommend this -- it's best to avoid interpreting p > alpha or adjusted p > alpha as evidence for something.

0
Entering edit mode

As I am trying to replicate the original findings of the paper, they state that their thresholds are the FDR and the 'mean'. In further analyses/illustrations they highlight 2-fold changes. So the approach as described seems to be replicating their approach I believe.

I do understand your argument on the p-values. Will definitely take that into account on further analyses!

Glad to hear that the 3-group approach with species1 + species2 + batch effects seems the right one for the 3 species comparison. Thank you for your help!