Idea: Using DESeq2 to test a change in ratio of two genes
1
2
Entering edit mode
@49b96c40
Last seen 6 weeks ago
Czechia

So I have a biological question that seems best understood as a change in ratio of two sequenced products between two conditions. So given a design of counts ~ condition the null hypothesis is that for each gene A and a fixed reference B: log(mu_A_condition1) - log(mu_B_condition1) >= log(mu_A_condition2) - log(mu_B_condition2). This is partially derived from the need to treat the data as compositional, but retain the neat noise model of DESeq2, because just taking the log ratios and doing a linear regression (as in CoDA) seems to underestimate the noise and discard the fact that low counts are more noisy than high counts.

My understanding is that this null hypothesis can actually be tested with DESeq2 with a little tweak, because with dummy coding it translates to beta_A_condition2 - beta_B_condition2 <= 0, so I can get the maximum likelihood estimate by just subtracting the two beta values given by DESeq2, get the SE by taking sqrt(beta_A_SE^2 + beta_B_SE_^2) (as the betas should have 0 covariance, right?) which is all I need to compute the Wald statistic and then I can once again carry on the same way DESeq2 does (p-adjustment, shrinkage, ...)

The question is:

• Does this seem sensible?
• What are some reasons this might be a bad idea?
• Are you aware of some work already using this approach?
• Is there some better alternative to this?

The only place I found this question asked previously is at https://www.biostars.org/p/470418/ where it got no answers.

Note that despite also being related to RIP-seq, this is a different question than DESeq2 testing ratio of ratios (RIP-Seq, CLIP-Seq, ribosomal profiling) as I do not want to compare ratio of ratios (IP/input) for the same gene between two conditions but rather want to get ratio of ratios (gene A / gene B) between IP and input.

Thanks for developing and maintaining DESeq2 - it is a great and robust tool!

DESeq2 • 277 views
1
Entering edit mode
@mikelove
Last seen 1 hour ago
United States

I don't think the betas have zero covariance, as gene expression tends to be correlated.

Do you want to do this for a single gene as reference for all other genes, or will you change B across different A?

0
Entering edit mode

Thanks for looking into this!

I don't think the betas have zero covariance, as gene expression tends to be correlated.

I agree this might be the case in the real world, but the DESeq2 model seems to assume they are not correlated (if I understand it correctly) - I definitely couldn't find a place where you would compute the Hessian or anything similar for the DESeq2 fit. The only place where I can see a non-zero covariance between betas (i.e. non-zero Hessian element for the two betas) could arise in the model is due to the dispersion fit and that should intuitively be small-ish as it does not affect the means of the expression directly. So if I trust the DESeq2 model to be sensible, then I implicitly also trust that the covariance is zero/small - or am I missing something?

Do you want to do this for a single gene as reference for all other genes, or will you change B across different A?

Basically keeping the reference fixed. For different conditions, we have 1 or 2 genes that could act as a sensible reference, but even when there are two references, we still want to compute all genes against both references.

1
Entering edit mode

DESeq2 in the end (after moderating dispersion values) tests each gene independently and applies BH correction to p-values, but we don't make assumptions about the covariance of LFC across genes because we don't compare LFC across genes. Here you would have to worry about that correlation. I don't see how you can assume that it is going to be zero/small.

If you want to use 1 or 2 genes as a reference, this sounds like you could simply use controlGenes when running estimateSizeFactors in front of DESeq(). Then DESeq() will be testing for differences in the counts, compared the expected values given the counts the samples had for the control gene(s). For the control genes, do all samples have a positive (non-zero) count?

0
Entering edit mode

Oh, using controlGenes seems like a very reasonable approach, thanks for the hint. Indeed it yields very similar results to my custom method which is reassuring. All the control genes are quite highly expressed in all conditions.

If I get it correctly the difference is the using controlGenes should let me handle the between-gene correlation, at the cost of taking the reference gene as a basically noise-free measurement, right? (I understand this is very much beyond the normal support for DESeq2, so feel free to not respond, if you don't have the time)

Thanks very much for the help!

1
Entering edit mode

Yes, it's true that we take the reference gene(s) as a noise free measurement of the reference variation with this approach, and control against it. So that is a downside.

I'm not sure which approach I prefer, I can see the simplicity of your approach as well, and that it doesn't assume the reference gene is measured without noise.

Another thought would be: can you find other genes that would help to build up the reference profile, by looking for those genes that are highly expressed and correlated with your current set of 1-2 reference genes?