Question: DESeq2 outlier flagging with 2 replicates
1
27 days ago by
stuart10
stuart10 wrote:

Hi all,

We have a 2-factor, 2 levels per factor experimental design with 2 replicates per covariate combination (so 8 samples total, n=2 per cell) that we are analysing using DESeq2.

Normally DESeq2 generates a Cook's distance for each replicate within a cell (by "cell" I mean unique combination of covariates) and flags those observations with a Cook's distance above a certain threshold. Now, I understand that with n=2 you can't tell which replicate is an outlier when you see a high Cook's distance, but this doesn't really matter for us since we are usually flagging and removing entire genes (not just individual observations) from the analysis, so we just need to detect whether one of the 2 reps is an outlier. (My understanding is that if n >= 7 the default behaviour changes to a more precise replacement of the individual observation with the high Cook's distance, but we are not aiming for this.) I see that DESeq2 still generates Cook's distances when n=2, and these roughly correlate with gene-wise dispersion. My question is can we legitimately use these to filter out genes containing an outlier?

If so, what threshold to use? From the vignette: The default Cook’s distance cutoff for the two behaviors described above depends on the sample size and number of parameters to be estimated. The default is to use the 99% quantile of the F(p,m-p) distribution (with p the number of parameters including the intercept and m number of samples). But I would not know how to calculate that threshold value.

If not, is there any other acceptable strategy for flagging genes with a likely outlier observation when n=2?

deseq2 • 65 views
modified 26 days ago by Michael Love24k • written 27 days ago by stuart10
Answer: DESeq2 outlier flagging with 2 replicates
1
26 days ago by
Michael Love24k
United States
Michael Love24k wrote:

Sure, you can use the Cook's distances to flag genes. You should have access to the mcols(dds)$maxCooks and the cutoff is: cooksCutoff <- qf(.99, p, m - p)  Where m is the number of samples, and p is the number of coefficients, e.g. length of resultsNames(dds). ADD COMMENTlink written 26 days ago by Michael Love24k Thank you Michael. However, I found that mcols(dds)$maxCooks was all NAs by default when n=2, so I added it with:

mcols(dds)\$maxCooks <- apply(assays(dds)[["cooks"]],1,max)


I hope this is equivalent to the normal calculation.

Best,

Stuart