Question: DESeq2 outlier flagging with 2 replicates
gravatar for stuart
4 months ago by
stuart10 wrote:

Hi all,

We have a 2-factor, 2 levels per factor experimental design with 2 replicates per covariate combination (so 8 samples total, n=2 per cell) that we are analysing using DESeq2.

Normally DESeq2 generates a Cook's distance for each replicate within a cell (by "cell" I mean unique combination of covariates) and flags those observations with a Cook's distance above a certain threshold. Now, I understand that with n=2 you can't tell which replicate is an outlier when you see a high Cook's distance, but this doesn't really matter for us since we are usually flagging and removing entire genes (not just individual observations) from the analysis, so we just need to detect whether one of the 2 reps is an outlier. (My understanding is that if n >= 7 the default behaviour changes to a more precise replacement of the individual observation with the high Cook's distance, but we are not aiming for this.) I see that DESeq2 still generates Cook's distances when n=2, and these roughly correlate with gene-wise dispersion. My question is can we legitimately use these to filter out genes containing an outlier?

If so, what threshold to use? From the vignette: The default Cook’s distance cutoff for the two behaviors described above depends on the sample size and number of parameters to be estimated. The default is to use the 99% quantile of the F(p,m-p) distribution (with p the number of parameters including the intercept and m number of samples). But I would not know how to calculate that threshold value.

If not, is there any other acceptable strategy for flagging genes with a likely outlier observation when n=2?

deseq2 • 101 views
ADD COMMENTlink modified 4 months ago by Michael Love25k • written 4 months ago by stuart10
Answer: DESeq2 outlier flagging with 2 replicates
gravatar for Michael Love
4 months ago by
Michael Love25k
United States
Michael Love25k wrote:

Sure, you can use the Cook's distances to flag genes. You should have access to the mcols(dds)$maxCooks and the cutoff is:

cooksCutoff <- qf(.99, p, m - p)

Where m is the number of samples, and p is the number of coefficients, e.g. length of resultsNames(dds).

ADD COMMENTlink written 4 months ago by Michael Love25k

Thank you Michael. However, I found that mcols(dds)$maxCooks was all NAs by default when n=2, so I added it with:

mcols(dds)$maxCooks <- apply(assays(dds)[["cooks"]],1,max)

I hope this is equivalent to the normal calculation.



ADD REPLYlink written 4 months ago by stuart10
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 326 users visited in the last hour