DESeq2 cooksCutoff ON/OFF: pairwise contrasts for many groups, some with only 2 reps
1
0
Entering edit mode
Zach Roe ▴ 10
@zach-roe-11189
Last seen 4.0 years ago

Hello,

I am fairly new to DESeq2, though I have used it before it was for experiments with more replicates.  I am hoping to get advice on how to deal with DE contrasts for this experiment in which I have 15 groups, 11 groups have 3 biological replicates/samples, but 4 groups are more rare and we could only obtain 2 samples.

I am doing 2 types of contrasts: (i) one group vs rest, (ii) pair-wise group vs group.  After I would like to compare the set differences and set intersections of the DE genes across certain groups.

Because I have some groups with only 2 samples, I would like to get advice on how to deal with difference in treatment with the flagging of genes based on Cook's distance which works when comparing groups with 3 samples, but not work in other contrasts with groups with 2 samples.  (DESeq2 user guide states: "The results function automatically flags genes which contain a Cook’s distance above a cutoff for samples which have 3 or more replicates. The p values and adjusted p values for these genes are set to NA. At least 3 replicates are required for flagging, as it is difficult to judge which sample might be an outlier with only 2 replicates. This filtering can be turned off with results(dds,cooksCutoff=FALSE).")

This leads to genes with adj p-val NA and ignored for the group contrasts with 3 samples in which count outlier are detected which is ideal, but for the contrasts of groups with only 2 samples, no flagging occurs. I do see genes detected to be DE that have large variance within the group with only 2 samples so this is an issue (I would like to exclude these genes).  Because I'm also interested to compare genes that are commonly DE across certain groups, this seems to also be a problem as DE selection is different.

Should I turn-off this Cook's filtering (results(dds,cooksCutoff=FALSE)) for all contrasts and apply my own filter afterwards to maintain consistency?  Could you advice how to apply this and how to find a threshold?  I don't have experience on this. I had thought to leave Cook's filtering on for the contrasts with groups with 3 samples and look at those outliers as reference but it's limited to genes with outliers in those groups.

I have searched through several previous questions, but have not been able to come to an answer that fits my situation.  Please excuse if there is a suitable response that I missed, if you could kindly direct me to that also.

Roez

deseq2 cookscutoff • 6.7k views
ADD COMMENT
3
Entering edit mode
@mikelove
Last seen 1 day ago
United States

hi Roez,

I'll start with just a quick explanation of Cook's distance: it measures within each gene, for each sample, how removing that sample would change the LFCs (all of the coefficients implied by the design and estimated by DESeq2).

So if you have e.g. 3 samples vs 2 samples, and the counts for a gene are [10,10,10] vs [15, 1000], you can see how the Cook's distance will be high for the two samples. Removing either one changes the LFC for the comparison of the two groups. However, if it were [10,10,10] vs [50,50], the two samples "support" each other, such that removing one doesn't change the LFC at all. Hence, we find Cook's to be useful for identifying outliers.

However, having 2 samples is really problematic to try to identify outliers. In particular, there's really no way to say if one or the other sample is an "outlier", or if it's just a gene with high dispersion (in addition to increased expression, e.g. in the above example). With 3 samples, it's really the bare minimum, but nevertheless we do enable filtering of genes which may contain extreme count outliers.

There's not much to do here to mitigate the effect of extreme count outliers. For a given comparison, you can look at the Cook's distances for the samples involved in the comparison (the Cook's distances for every gene and every sample are in assays(dds)[["cooks"]]). But we don't have any automatic methods in place, because I'm skeptical of the utility of any method that would flag or filter based on outliers in a group of 2 samples. I think you have to examine these distances by eye and if necessary come up with a threshold that makes sense for you.

ADD COMMENT
0
Entering edit mode

Thank you Michael.  I am wondering if I could run all the contrasts as is (ignoring the non-flagging in the 2 sample cases), set my adj. p-val and lfc cut off, but after this DE analysis set a criteria that filters the DE genes further... possibly using the mean vs. difference pairwise-plots of all biological replicates for each gene to find some threshold, and removing genes in all contrasts if exceeds this threshold.  Mostly I just want to make sure to exclude genes if the within sample variance is higher that between sample variance in those groups with only 2 samples.

ADD REPLY
0
Entering edit mode

hmm, I don't really follow your proposed filtering rule, and it sounds like it could potentially result in loss of control of false positives. In particular, you can't do this: "exclude genes if the within [group] variance is higher than between [group] variance". This will certainly result in loss of control of false positives.

ADD REPLY
0
Entering edit mode

Sorry for my confusion, thank you for pointing that out.  Then I should apply any filtering criteria before controlling for FDR and not after?

Roez 

ADD REPLY

Login before adding your answer.

Traffic: 403 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6