Question

DESeq2 with continuous variables - predictors and/or outcomes

0

Entering edit mode

stoyo_karamihalev • 0

@stoyo_karamihalev-13297

Last seen 7.8 years ago

Dear all,

I have an RNAseq dataset with 32 biological replicates, where the points of interest are continuous (trait) variables in the phenotype data and their relationship to gene expression. So far I have been using the DESeq2 and Sleuth modelling options and treating the LRT p-values per gene as the only cutoff for what constitutes "robust" relationships (since the betas are difficult to interpret).

The models look as follows:

gene ~ intercept + batch.covariate1 + batch.covariate2 + trait (continuous)

Several questions on this topic:

1. If I extract the norm counts from deseq and run the same regressions (negative binomial) myself, the results don't match the deseq LRT results. I understand that LRT is not the same test, but wouldn't some overlap be expected? I'm not really clear on how deseq performs its normalizations prior to the modelling step when there are no conditions, or if that's even relevant here. Either way, are there any issues with using deseq in this way?

2. Another option for approaching, which seems at this point rather unpopular, is treat the gene count data as a predictor and the phenotype trait as the outcome (intuitively this makes a lot more sense to me), as in:

trait ~ intercept + batch.covariates + gene

I've seen one study which took this approach: (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4829873/)

My issue here, though, is that using their approach - robust regression - I what seem like way too many significant "hits" (in the hundreds to thousands, out of ca. 15K genes being tested). Seems unlikely, though, looking at the scatterplots per gene, the relationships do appear to be there.

If anyone has dealt with these questions before, I would appreciate any help!

Kind regards,

Stoyo

deseq2 negative binomial model robust regression sleuth • 6.4k views

ADD COMMENT • link updated 7.8 years ago by Michael Love 43k • written 7.8 years ago by stoyo_karamihalev • 0

score 1 · Answer 1 · 2017-06-21

The main differences between taking normalized counts and running a NB regression per gene and what DESeq2 returns is: (1) DESeq2 uses a max posterior dispersion estimate, borrowing information from all genes, so it has much less error compared to what the NB GLM will estimate by looking at a single gene, (2) DESeq2 includes the normalization as part of the offset, and works on raw counts, which gives a boost to the model in terms of knowing the precision of each measurement.

My advice would be to use a design of ~batch1 + batch2 + trait with reduced design of ~batch1 + batch2, then look at the top genes by padj using plotCounts. You may decide from looking at the plotCounts results that linear is not the best relationship (assuming linear between log counts and trait), in which case you could also consider binning the trait into 3 or 4 groups with the cut() function and selecting relevant breakpoints. Or you could use smooth functions of trait, e.g. the splines package, and include a spline term in the full design.