Question

[DESeq2] Cook's distance flagging for continuous variables

0

Entering edit mode

enricoferrero ▴ 660

@enricoferrero-6037

Last seen 3.1 years ago

Switzerland

Hi,

In DESeq2, is the Cook's distance-based flagging of p-values performed when using a continuous variable?

I don't see how the condition:

"At least 3 replicates are required for flagging, as it is difficult to judge which sample
might be an outlier with only 2 replicates."

can realistically be met for continuous variables.

If that's the case, how can I avoid genes where one or more samples have high Cook's distances (i.e.: count outliers) coming up as significant?

Please let me know if this needs clarification.

Thanks!

deseq2 rnaseq • 7.0k views

ADD COMMENT • link updated 9.6 years ago by Michael Love 43k • written 9.6 years ago by enricoferrero ▴ 660

score 3 · Answer 1 · 2015-06-02

3

Entering edit mode

Michael Love 43k

@mikelove

Last seen 1 day ago

United States

hi Enrico,

Thanks for bringing this up. In earlier versions, I had filtering across the board but then over time came up with improvements which use robust within-group variance estimates. So currently for designs with continuous variables there is not automatic filtering on Cook's distance although Cook's distance is calculated for every gene x sample and stored in assays(dds)[["cooks"]]. So I would recommend for now if you want to examine for outliers they you do so through exploratory plots (plotting the maximum Cook's for each gene over the base mean or over the log fold change for example to identify outlier genes). This is missing in the documentation, and I need to fix this (I'll provide example code for making this plot as well).

maxCooks <- apply(assays(dds)[["cooks"]], 1, max)

ADD COMMENT • link 9.6 years ago Michael Love 43k

0

Entering edit mode

Hi Michael,

Thanks for replying so quickly.

Does this mean that setting a different cooksCutoff argument in results() does not affect the p-values for continuous variables?

Can you provide an example and/or code for the plot(s) you're talking about?

What do you think could be the best way to determine a Cook's distance threshold that I can use to manually flag those genes in the DeSeqResults object?

Thanks!

ADD REPLY • link 9.6 years ago enricoferrero ▴ 660

1

Entering edit mode

That's correct, it won't affect p-values with a continuous variable in the design. Here's some code for making diagnostic plots:

mcols(dds)$maxCooks <- apply(assays(dds)[["cooks"]], 1, max)
plot(mcols(dds)$baseMean, mcols(dds)$maxCooks)
# this requires you not filter or order the 'res' object
stopifnot(all.equal(rownames(dds), rownames(res)))
plot(res$log2FoldChange, mcols(dds)$maxCooks)

After looking at examples and upon deciding you want to filter on Cook's distance per gene, you can specify cooksCutoff:

res$pvalue[mcols(dds)$maxCooks > cooksCutoff] <- NA
# optionally, also mean filtering:
res$pvalue[res$baseMean < meanFilter] <- NA
res$padj <- p.adjust(res$pvalue, method="BH")

ADD REPLY • link 9.6 years ago Michael Love 43k

0

Entering edit mode

Michael, thanks, that's really helpful.

Is there a reason why I can't just use the 99% percentile of the F(p, m-p) distribution as the Cook's cutoff and apply it across all counts regardless of the treatment group?

I computed this with qf(0.99, p, m-p) and the result is about 10. If you take a look at the plots below you can see that this is going to throw out nearly all of my genes. Based on the plots, what do you think would be a reasonable cutoff and why?

Thank you!

ADD REPLY • link 9.5 years ago enricoferrero ▴ 660

1

Entering edit mode

I was just going to suggest looking for outliers (large mean or fold change and large Cook's distance). It looks like there's a problem with the scaling here. Are you using the latest version of DESeq2 (v1.8)?

ADD REPLY • link 9.5 years ago Michael Love 43k

0

Entering edit mode

Yes, it's the latest version (DESeq2_1.8.1). Note that the y-axis is log-scaled.

Other than that, I guess it's just that data is noisy, with high variability, hence the large Cook's distances.

Still, some of the genes that I get out of the differential expression analysis do display nice trends. Some others are clearly flagged as significant just because there is one sample that is a count outlier. These are the genes I'm trying to remove.

I'm leaning towards a cutoff of 1e9 or 1e10 to get rid of the biggest outliers, does that make sense?

ADD REPLY • link 9.5 years ago enricoferrero ▴ 660

1

Entering edit mode

I was unable to recreate these large values using continuous variables on a simulated dataset on my end, but I'll keep trying. My recommendation is to use the examples that you've identified as outliers to come up with a threshold. If that doesn't look reasonable (if the Cook's distance for your examples is in the middle of this distribution here), then you can filter a dataset with many large outliers using another statistic, for example, the percent of the row sum which is in a single sample (divide the row max by the row sum of normalized counts).

ADD REPLY • link 9.5 years ago Michael Love 43k

0

Entering edit mode

I've investigated this further and noticed that the same sample was almost always the responsible for the extraordinarily high Cook's distance.

This is what those diagnostic plots look like after removing that sample:

http://i.imgur.com/XajDiSP.png

http://i.imgur.com/STjemVn.png

This is very interesting considering that the same sample didn't show anything odd in the PCA and hierarchical clustering. However, it turns out this sample was sequenced under different conditions and had shorter read length but much higher sequencing depth.

I'm probably going to remove this sample but I'm not entirely sure what's going on, do you understand? I would expect such a big outlier to be easily identified during PCA/clustering, no?

ADD REPLY • link 9.5 years ago enricoferrero ▴ 660

1

Entering edit mode

You said earlier "Some [genes] are clearly flagged as significant just because there is one sample that is a count outlier."

Is it this one sample that is the consistent count outlier, or is it different samples for different genes?

ADD REPLY • link 9.5 years ago Michael Love 43k

0

Entering edit mode

I wish I could answer this with a simple yes/no.

That sample is a count outlier most of the time, but not always. There are also other 3 samples that are sometimes outliers when looking at the counts, albeit less frequently.

ADD REPLY • link 9.5 years ago enricoferrero ▴ 660

1

Entering edit mode

If that sample was often a count outlier, in addition to the different sequencing conditions, I would agree you should remove it.

ADD REPLY • link 9.5 years ago Michael Love 43k