Question

DESeq2 - couple of clarifications

1

Entering edit mode

Federico Marini ▴ 180

@federico-marini-6465

Last seen 10 months ago

Germany

Hey Mike,

a couple of questions on DESeq2, but first of all, some code to make my questions reproducible:

library(airway)
library(DESeq2)
library(magrittr)
dds_airway <- DESeq2::DESeqDataSetFromMatrix(assay(airway),
                                              colData = colData(airway),
                                              design=~cell+dex)
dds_airway <- DESeq(dds_airway)

alpha & independentFiltering. Can it be a tiny bug that when I set independentFiltering to FALSE, then the alpha is somehow not "set" in the DESeqResults object? Please compare the outcomes of these commands

(results(dds_airway,contrast=c("dex","trt","untrt"),alpha= 0.05,independentFiltering = T))  %>% summary
(results(dds_airway,contrast=c("dex","trt","untrt"),alpha= 0.05,independentFiltering = F))  %>% summary
(results(dds_airway,contrast=c("dex","trt","untrt"),alpha= 0.05,independentFiltering = F))  %>% summary(alpha=0.05)

For an app development, I am trying to cover "automatically" the cases where the covariate is a factor, a continuous one or also where the levels are more than two. Quick check I am doing it right, according to the documentation:
factor -> contrast = the 3-element vector
numeric -> name = the character name of the numeric
more than 2 levels -> rerun DESeq with "LRT" as test and then use the full & reduced model to specify the contrast

Moreover, are you by chance aware of a dataset where there was a (possibly meaningful) use of a continuous covariate? As a toy case I am using airway with the read length and I am (correctly) getting very few hits. Or if not, do you know a robust way of simulating such a dataset?
I have seen you recommending the salmon path now for generating the counts, especially after the DTE/DGE/DTU paper of you and Charlotte. I found it a little harder to explain to the cooperation partners with the extra modeling-step already at the counting level, and this is kind of keeping me in the "old and safe" featureCounts-based approach. Do you have a suggestion on how to sell at best the advantages of the new method, well, apart from linking to your paper?

Thank you in advance!

Federico

deseq2 • 1.2k views

ADD COMMENT • link updated 8.7 years ago by Michael Love 43k • written 8.7 years ago by Federico Marini ▴ 180

score 1 · Accepted Answer · 2016-10-28

Regarding 'alpha' in results() and summary(), when you have independentFiltering=TRUE, then the alpha is used by the function to optimize the independent filtering, and then it's used again as a relevant threshold by summary() when alpha is not explicitly provided to summary(). If you have independentFiltering=FALSE, then alpha is ignored by results() and not passed to summary(). I've clarified this just now in the help page.

The second question sounds right, although for a factor with more than two levels, sometimes users want to do 2-3 pairwise (B vs A, C vs A, sometimes C vs B), and sometimes they want a LRT.

When other developers have worked on wrappers for DESeq2 (for example, ReportingTools), they've encountered a number of headaches by trying to call results() internally to their software, because it takes a lot of effort to provide all the functionality that results() provides. This is why I've often recommended that, if possible, developers let users interface with DESeq2::results() directly, and then operate on the DESeqResults table instead. But it's up to you.

I don't have a publicly available, processed dataset in mind with a numeric covariate, but I'm sure many exist. The trick is that you first need to do some exploration to make sure that a linear relationship between the covariate and log counts makes sense, i.e. to rule out the possibility of saturation, or convex or concave patterns.

Re: selling the new methods, it's good to keep in mind that the estimated counts are highly correlated with the unique counts. The bonus is: much faster and more efficient generation of these matrices, possibility to recover multi-mapping reads through probabilistic assignment, avoids any potential issues with DTU which could throw off inference from gene-level unique counts.