Question

DESeq2 cooksCutoff ON/OFF: pairwise contrasts for many groups, some with only 2 reps

0

Entering edit mode

Zach Roe ▴ 10

@zach-roe-11189

Last seen 4.0 years ago

Hello,

I am fairly new to DESeq2, though I have used it before it was for experiments with more replicates. I am hoping to get advice on how to deal with DE contrasts for this experiment in which I have 15 groups, 11 groups have 3 biological replicates/samples, but 4 groups are more rare and we could only obtain 2 samples.

I am doing 2 types of contrasts: (i) one group vs rest, (ii) pair-wise group vs group. After I would like to compare the set differences and set intersections of the DE genes across certain groups.

Because I have some groups with only 2 samples, I would like to get advice on how to deal with difference in treatment with the flagging of genes based on Cook's distance which works when comparing groups with 3 samples, but not work in other contrasts with groups with 2 samples. (DESeq2 user guide states: "The results function automatically flags genes which contain a Cook’s distance above a cutoff for samples which have 3 or more replicates. The p values and adjusted p values for these genes are set to NA. At least 3 replicates are required for flagging, as it is difficult to judge which sample might be an outlier with only 2 replicates. This filtering can be turned off with results(dds,cooksCutoff=FALSE).")

This leads to genes with adj p-val NA and ignored for the group contrasts with 3 samples in which count outlier are detected which is ideal, but for the contrasts of groups with only 2 samples, no flagging occurs. I do see genes detected to be DE that have large variance within the group with only 2 samples so this is an issue (I would like to exclude these genes). Because I'm also interested to compare genes that are commonly DE across certain groups, this seems to also be a problem as DE selection is different.

Should I turn-off this Cook's filtering (results(dds,cooksCutoff=FALSE)) for all contrasts and apply my own filter afterwards to maintain consistency? Could you advice how to apply this and how to find a threshold? I don't have experience on this. I had thought to leave Cook's filtering on for the contrasts with groups with 3 samples and look at those outliers as reference but it's limited to genes with outliers in those groups.

I have searched through several previous questions, but have not been able to come to an answer that fits my situation. Please excuse if there is a suitable response that I missed, if you could kindly direct me to that also.

Roez

deseq2 cookscutoff • 6.7k views

ADD COMMENT • link updated 7.2 years ago by Michael Love 41k • written 7.2 years ago by Zach Roe ▴ 10

score 3 · Answer 1 · 2017-02-13

hi Roez,

I'll start with just a quick explanation of Cook's distance: it measures within each gene, for each sample, how removing that sample would change the LFCs (all of the coefficients implied by the design and estimated by DESeq2).

So if you have e.g. 3 samples vs 2 samples, and the counts for a gene are [10,10,10] vs [15, 1000], you can see how the Cook's distance will be high for the two samples. Removing either one changes the LFC for the comparison of the two groups. However, if it were [10,10,10] vs [50,50], the two samples "support" each other, such that removing one doesn't change the LFC at all. Hence, we find Cook's to be useful for identifying outliers.

However, having 2 samples is really problematic to try to identify outliers. In particular, there's really no way to say if one or the other sample is an "outlier", or if it's just a gene with high dispersion (in addition to increased expression, e.g. in the above example). With 3 samples, it's really the bare minimum, but nevertheless we do enable filtering of genes which may contain extreme count outliers.

There's not much to do here to mitigate the effect of extreme count outliers. For a given comparison, you can look at the Cook's distances for the samples involved in the comparison (the Cook's distances for every gene and every sample are in assays(dds)[["cooks"]]). But we don't have any automatic methods in place, because I'm skeptical of the utility of any method that would flag or filter based on outliers in a group of 2 samples. I think you have to examine these distances by eye and if necessary come up with a threshold that makes sense for you.