Hi,
In DESeq2, is the Cook's distance-based flagging of p-values performed when using a continuous variable?
I don't see how the condition:
"At least 3 replicates are required for flagging, as it is difficult to judge which sample
might be an outlier with only 2 replicates."
can realistically be met for continuous variables.
If that's the case, how can I avoid genes where one or more samples have high Cook's distances (i.e.: count outliers) coming up as significant?
Please let me know if this needs clarification.
Thanks!
Hi Michael,
Thanks for replying so quickly.
Does this mean that setting a different
cooksCutoff
argument inresults()
does not affect the p-values for continuous variables?Can you provide an example and/or code for the plot(s) you're talking about?
What do you think could be the best way to determine a Cook's distance threshold that I can use to manually flag those genes in the
DeSeqResults
object?Thanks!
That's correct, it won't affect p-values with a continuous variable in the design. Here's some code for making diagnostic plots:
After looking at examples and upon deciding you want to filter on Cook's distance per gene, you can specify cooksCutoff:
res$pvalue[mcols(dds)$maxCooks > cooksCutoff] <- NA # optionally, also mean filtering: res$pvalue[res$baseMean < meanFilter] <- NA res$padj <- p.adjust(res$pvalue, method="BH")
Michael, thanks, that's really helpful.
Is there a reason why I can't just use the 99% percentile of the F(p, m-p) distribution as the Cook's cutoff and apply it across all counts regardless of the treatment group?
I computed this with
qf(0.99, p, m-p)
and the result is about 10. If you take a look at the plots below you can see that this is going to throw out nearly all of my genes. Based on the plots, what do you think would be a reasonable cutoff and why?Thank you!
I was just going to suggest looking for outliers (large mean or fold change and large Cook's distance). It looks like there's a problem with the scaling here. Are you using the latest version of DESeq2 (v1.8)?
Yes, it's the latest version (DESeq2_1.8.1). Note that the y-axis is log-scaled.
Other than that, I guess it's just that data is noisy, with high variability, hence the large Cook's distances.
Still, some of the genes that I get out of the differential expression analysis do display nice trends. Some others are clearly flagged as significant just because there is one sample that is a count outlier. These are the genes I'm trying to remove.
I'm leaning towards a cutoff of 1e9 or 1e10 to get rid of the biggest outliers, does that make sense?
I was unable to recreate these large values using continuous variables on a simulated dataset on my end, but I'll keep trying. My recommendation is to use the examples that you've identified as outliers to come up with a threshold. If that doesn't look reasonable (if the Cook's distance for your examples is in the middle of this distribution here), then you can filter a dataset with many large outliers using another statistic, for example, the percent of the row sum which is in a single sample (divide the row max by the row sum of normalized counts).
I've investigated this further and noticed that the same sample was almost always the responsible for the extraordinarily high Cook's distance.
This is what those diagnostic plots look like after removing that sample:
http://i.imgur.com/XajDiSP.png
http://i.imgur.com/STjemVn.png
This is very interesting considering that the same sample didn't show anything odd in the PCA and hierarchical clustering. However, it turns out this sample was sequenced under different conditions and had shorter read length but much higher sequencing depth.
I'm probably going to remove this sample but I'm not entirely sure what's going on, do you understand? I would expect such a big outlier to be easily identified during PCA/clustering, no?
You said earlier "Some [genes] are clearly flagged as significant just because there is one sample that is a count outlier."
Is it this one sample that is the consistent count outlier, or is it different samples for different genes?
I wish I could answer this with a simple yes/no.
That sample is a count outlier most of the time, but not always. There are also other 3 samples that are sometimes outliers when looking at the counts, albeit less frequently.
If that sample was often a count outlier, in addition to the different sequencing conditions, I would agree you should remove it.