DESeq2 outlier flagging with 2 replicates
1
1
Entering edit mode
stuart ▴ 10
@stuart-9561
Last seen 3.4 years ago
Australia

Hi all,

We have a 2-factor, 2 levels per factor experimental design with 2 replicates per covariate combination (so 8 samples total, n=2 per cell) that we are analysing using DESeq2.

Normally DESeq2 generates a Cook's distance for each replicate within a cell (by "cell" I mean unique combination of covariates) and flags those observations with a Cook's distance above a certain threshold. Now, I understand that with n=2 you can't tell which replicate is an outlier when you see a high Cook's distance, but this doesn't really matter for us since we are usually flagging and removing entire genes (not just individual observations) from the analysis, so we just need to detect whether one of the 2 reps is an outlier. (My understanding is that if n >= 7 the default behaviour changes to a more precise replacement of the individual observation with the high Cook's distance, but we are not aiming for this.) I see that DESeq2 still generates Cook's distances when n=2, and these roughly correlate with gene-wise dispersion. My question is can we legitimately use these to filter out genes containing an outlier?

If so, what threshold to use? From the vignette: The default Cook’s distance cutoff for the two behaviors described above depends on the sample size and number of parameters to be estimated. The default is to use the 99% quantile of the F(p,m-p) distribution (with p the number of parameters including the intercept and m number of samples). But I would not know how to calculate that threshold value.

If not, is there any other acceptable strategy for flagging genes with a likely outlier observation when n=2?

deseq2 • 1.2k views
ADD COMMENT
1
Entering edit mode
@mikelove
Last seen 4 days ago
United States

Sure, you can use the Cook's distances to flag genes. You should have access to the mcols(dds)$maxCooks and the cutoff is:

cooksCutoff <- qf(.99, p, m - p)

Where m is the number of samples, and p is the number of coefficients, e.g. length of resultsNames(dds).

ADD COMMENT
0
Entering edit mode

Thank you Michael. However, I found that mcols(dds)$maxCooks was all NAs by default when n=2, so I added it with:

mcols(dds)$maxCooks <- apply(assays(dds)[["cooks"]],1,max)

I hope this is equivalent to the normal calculation.

Best,

Stuart

ADD REPLY

Login before adding your answer.

Traffic: 603 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6