So I have a biological question that seems best understood as a change in ratio of two sequenced products between two conditions. So given a design of counts ~ condition
the null hypothesis is that for each gene A
and a fixed reference B
:
log(mu_A_condition1) - log(mu_B_condition1) >= log(mu_A_condition2) - log(mu_B_condition2)
. This is partially derived from the need to treat the data as compositional, but retain the neat noise model of DESeq2, because just taking the log ratios and doing a linear regression (as in CoDA) seems to underestimate the noise and discard the fact that low counts are more noisy than high counts.
My understanding is that this null hypothesis can actually be tested with DESeq2 with a little tweak, because with dummy coding it translates to beta_A_condition2 - beta_B_condition2 <= 0
, so I can get the maximum likelihood estimate by just subtracting the two beta
values given by DESeq2, get the SE by taking sqrt(beta_A_SE^2 + beta_B_SE_^2)
(as the betas should have 0 covariance, right?) which is all I need to compute the Wald statistic and then I can once again carry on the same way DESeq2 does (p-adjustment, shrinkage, ...)
The question is:
- Does this seem sensible?
- What are some reasons this might be a bad idea?
- Are you aware of some work already using this approach?
- Is there some better alternative to this?
The only place I found this question asked previously is at https://www.biostars.org/p/470418/ where it got no answers.
Note that despite also being related to RIP-seq, this is a different question than DESeq2 testing ratio of ratios (RIP-Seq, CLIP-Seq, ribosomal profiling) as I do not want to compare ratio of ratios (IP/input) for the same gene between two conditions but rather want to get ratio of ratios (gene A / gene B) between IP and input.
Thanks for developing and maintaining DESeq2 - it is a great and robust tool!
Thanks for looking into this!
I agree this might be the case in the real world, but the DESeq2 model seems to assume they are not correlated (if I understand it correctly) - I definitely couldn't find a place where you would compute the Hessian or anything similar for the DESeq2 fit. The only place where I can see a non-zero covariance between betas (i.e. non-zero Hessian element for the two betas) could arise in the model is due to the dispersion fit and that should intuitively be small-ish as it does not affect the means of the expression directly. So if I trust the DESeq2 model to be sensible, then I implicitly also trust that the covariance is zero/small - or am I missing something?
Basically keeping the reference fixed. For different conditions, we have 1 or 2 genes that could act as a sensible reference, but even when there are two references, we still want to compute all genes against both references.
DESeq2 in the end (after moderating dispersion values) tests each gene independently and applies BH correction to p-values, but we don't make assumptions about the covariance of LFC across genes because we don't compare LFC across genes. Here you would have to worry about that correlation. I don't see how you can assume that it is going to be zero/small.
If you want to use 1 or 2 genes as a reference, this sounds like you could simply use
controlGenes
when runningestimateSizeFactors
in front ofDESeq()
. ThenDESeq()
will be testing for differences in the counts, compared the expected values given the counts the samples had for the control gene(s). For the control genes, do all samples have a positive (non-zero) count?Oh, using
controlGenes
seems like a very reasonable approach, thanks for the hint. Indeed it yields very similar results to my custom method which is reassuring. All the control genes are quite highly expressed in all conditions.If I get it correctly the difference is the using
controlGenes
should let me handle the between-gene correlation, at the cost of taking the reference gene as a basically noise-free measurement, right? (I understand this is very much beyond the normal support for DESeq2, so feel free to not respond, if you don't have the time)Thanks very much for the help!
Yes, it's true that we take the reference gene(s) as a noise free measurement of the reference variation with this approach, and control against it. So that is a downside.
I'm not sure which approach I prefer, I can see the simplicity of your approach as well, and that it doesn't assume the reference gene is measured without noise.
Another thought would be: can you find other genes that would help to build up the reference profile, by looking for those genes that are highly expressed and correlated with your current set of 1-2 reference genes?