I have an RNAseq dataset with 32 biological replicates, where the points of interest are continuous (trait) variables in the phenotype data and their relationship to gene expression. So far I have been using the DESeq2 and Sleuth modelling options and treating the LRT p-values per gene as the only cutoff for what constitutes "robust" relationships (since the betas are difficult to interpret).
The models look as follows:
gene ~ intercept + batch.covariate1 + batch.covariate2 + trait (continuous)
Several questions on this topic:
1. If I extract the norm counts from deseq and run the same regressions (negative binomial) myself, the results don't match the deseq LRT results. I understand that LRT is not the same test, but wouldn't some overlap be expected? I'm not really clear on how deseq performs its normalizations prior to the modelling step when there are no conditions, or if that's even relevant here. Either way, are there any issues with using deseq in this way?
2. Another option for approaching, which seems at this point rather unpopular, is treat the gene count data as a predictor and the phenotype trait as the outcome (intuitively this makes a lot more sense to me), as in:
trait ~ intercept + batch.covariates + gene
I've seen one study which took this approach: (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4829873/)
My issue here, though, is that using their approach - robust regression - I what seem like way too many significant "hits" (in the hundreds to thousands, out of ca. 15K genes being tested). Seems unlikely, though, looking at the scatterplots per gene, the relationships do appear to be there.
If anyone has dealt with these questions before, I would appreciate any help!