Question

DESeq2 outlier flagging with 2 replicates

1

Entering edit mode

stuart ▴ 10

@stuart-9561

Last seen 4.6 years ago

Australia

Hi all,

We have a 2-factor, 2 levels per factor experimental design with 2 replicates per covariate combination (so 8 samples total, n=2 per cell) that we are analysing using DESeq2.

Normally DESeq2 generates a Cook's distance for each replicate within a cell (by "cell" I mean unique combination of covariates) and flags those observations with a Cook's distance above a certain threshold. Now, I understand that with n=2 you can't tell which replicate is an outlier when you see a high Cook's distance, but this doesn't really matter for us since we are usually flagging and removing entire genes (not just individual observations) from the analysis, so we just need to detect whether one of the 2 reps is an outlier. (My understanding is that if n >= 7 the default behaviour changes to a more precise replacement of the individual observation with the high Cook's distance, but we are not aiming for this.) I see that DESeq2 still generates Cook's distances when n=2, and these roughly correlate with gene-wise dispersion. My question is can we legitimately use these to filter out genes containing an outlier?

If so, what threshold to use? From the vignette: The default Cook’s distance cutoff for the two behaviors described above depends on the sample size and number of parameters to be estimated. The default is to use the 99% quantile of the F(p,m-p) distribution (with p the number of parameters including the intercept and m number of samples). But I would not know how to calculate that threshold value.

If not, is there any other acceptable strategy for flagging genes with a likely outlier observation when n=2?

deseq2 • 1.5k views

ADD COMMENT • link updated 6.6 years ago by Michael Love 43k • written 6.6 years ago by stuart ▴ 10

score 1 · Answer 1 · 2019-06-20

1

Entering edit mode

Michael Love 43k

@mikelove

Last seen 16 hours ago

United States

Sure, you can use the Cook's distances to flag genes. You should have access to the mcols(dds)$maxCooks and the cutoff is:

cooksCutoff <- qf(.99, p, m - p)

Where m is the number of samples, and p is the number of coefficients, e.g. length of resultsNames(dds).

ADD COMMENT • link 6.6 years ago Michael Love 43k

0

Entering edit mode

Thank you Michael. However, I found that mcols(dds)$maxCooks was all NAs by default when n=2, so I added it with:

mcols(dds)$maxCooks <- apply(assays(dds)[["cooks"]],1,max)

I hope this is equivalent to the normal calculation.

Best,

Stuart

ADD REPLY • link 6.6 years ago stuart ▴ 10