I read the code of DESeq2 roughly and realized that for each gene, one dispersion estimate is used across all samples. My work is to compare the gene expression profiles of mice from 2 different genetic backgrounds (WT and KO). Certain genes are expected to be highly expressed in one genotype but not expressed in the other. I'm curious about how DESeq2 would estimate 1 dispersion value that would fit both genotypes.
I read from the DESeq2 vignette that my design formula will affect the way dispersion is calculated, but no details were given. I couldn't find details regarding how DESeq2 handles this from the reference manual either. I'm not very familiar with bayesian models so I decided to reach out here instead of digging through the code.
The type of sample I receive looks like this:
df <- data.frame(sample = paste0("s", 1:6), genotype = c("WT", "KO", "WT", "KO","WT", "WT" ), batch = paste0("batch", c(1,1,2,2,3,3))) > df sample genotype batch 1 s1 WT batch1 2 s2 KO batch1 3 s3 WT batch2 4 s4 KO batch2 5 s5 WT batch3 6 s6 WT batch3
The current "flaw" is that in the
batch3 both are WT, because the experiment is not done yet and I'll have more batches coming.
To construct a generalized linear model, it seems reasonable to have the design formula like this:
~ batch + genotype. From my understanding of glm, the order of
genotype in the design formula shouldn't matter. However, DESeq2 vignette suggests putting the most important variable in the last place. I assume this is for the purpose of dispersion estimation?
Also I think with the design formula
~batch + genotype,
batch3 would be useless in the estimation of differential expression between genotypes since the differences are regressed out in the "batch" part, is that correct?
I would highly appreciate any input on this! Thanks~