Deciding on cooks cutoff value DESeq2
Entering edit mode
Last seen 2.9 years ago

I am trying to determine how best to manually set a cooksCutoff value for outlier detection, and have made the following observations:

          0%          25%          50%          75%         90%       95%       99%    100%
  0.00360426   0.08825488   0.21268380   0.73935583 2.970987   5.582727   13.544483   135.17274114

I've made a plot of maxCooks vs p-adj which I'm having trouble including in this post (but will continue trying to get it on here if someone more knowledgeable than I would like to take a look), but the graph essentially demonstrates that large maxCooks values do not necessarily correspond to low/significant p-adj values.

Additionally, if it makes any difference, my MA plot looks much different than what I think is to be expected. The vast majority of points, and colored points, are above the midline (me including this fact might just reveal how truly out of my depth I'm in).

When I inspect the number of significant genes I get, a lot of them seem to be influenced by the presence of a count outlier that seems to always be from the same sample. Is there a quantitative way to determine a cutoff, or am I left to eyeball it? Should I start by setting the cutoff to the top x% of maxCooks values and widdle away from there, or is there a more empirical way of approaching this? I'm pretty stumped, so any and all advice would be much appreciated.

Update: Setting a cooksCutoff value of 5 yields 3 significant genes that still seem to be influenced by a count outlier from the same sample

deseq2 differential gene expression outliers bioconductor • 1.9k views
Entering edit mode
Last seen 1 day ago
United States

Can you describe your experiment, what the column data looks like, and post your sessionInfo()?

When you look at a PCA plot of the samples, is this sample an outlier?

Entering edit mode

Thank you for your quick response. I am still waiting for the rlog transform function to finish running and will get the PCA plot in the morning. My experiment is looking at the difference in the change in gene expression between men and women as a result of a treatment. I have 113 paired samples, and used STAR to align and featureCounts to count genes. My design is ~ sex + sex:nested + sex:condition. To get the results, I first set the model matrix: apply(model, 2, function(x) all(x == 0)), get my dds: dds <- DESeq(dds, full = model, betaPrior = FALSE), and then make the contrast: sex_diff = results(results_source, contrast = list("sexFemale.conditionpost", "sexMale.conditionpost"))

My colData looks something like the following:

sample sex condition nested
sample1_pre M pre 1
sample1_post M post 1
sample70_pre F pre 1
sample70_post F pre 1
... ... ... ...


Any help is much appreciated, and thank you for being so generous with your time.

Entering edit mode

> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.6 (El Capitan)

[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] IHW_1.2.0                  DESeq2_1.14.1              SummarizedExperiment_1.4.0 Biobase_2.34.0            
[5] GenomicRanges_1.26.2       GenomeInfoDb_1.10.2        IRanges_2.8.1              S4Vectors_0.12.1          
[9] BiocGenerics_0.20.0       

loaded via a namespace (and not attached):
 [1] genefilter_1.56.0    locfit_1.5-9.1       slam_0.1-40          splines_3.3.1        lattice_0.20-34     
 [6] colorspace_1.3-2     htmltools_0.3.5      base64enc_0.1-3      survival_2.40-1      XML_3.98-1.5        
[11] foreign_0.8-67       DBI_0.5-1            BiocParallel_1.8.1   RColorBrewer_1.1-2   plyr_1.8.4          
[16] stringr_1.1.0        zlibbioc_1.20.0      munsell_0.4.3        gtable_0.2.0         htmlwidgets_0.8     
[21] memoise_1.0.0        latticeExtra_0.6-28  knitr_1.15.1         geneplotter_1.52.0   fdrtool_1.2.15      
[26] AnnotationDbi_1.36.2 htmlTable_1.9        Rcpp_0.12.9          acepack_1.4.1        xtable_1.8-2        
[31] backports_1.0.5      scales_0.4.1         checkmate_1.8.2      Hmisc_4.0-2          annotate_1.52.1     
[36] XVector_0.14.0       lpsymphony_1.2.0     gridExtra_2.2.1      ggplot2_2.2.1        digest_0.6.12       
[41] stringi_1.1.2        grid_3.3.1           tools_3.3.1          bitops_1.0-6         magrittr_1.5        
[46] lazyeval_0.2.0       RCurl_1.95-4.8       tibble_1.2           RSQLite_1.1-2        Formula_1.2-1       
[51] cluster_2.0.5        Matrix_1.2-8         data.table_1.10.2    assertthat_0.1       rpart_4.1-10        
[56] nnet_7.3-12 

Entering edit mode
I recommend vst() instead of rlog() when there are many samples. It's much faster.
Entering edit mode

The following links are to the images I have referenced:

boxplot of maxCooks:

maxCooks vs. padj:

example of gene with outlier at cooks ~5:

strange looking MA plot:

PCA plot:

PCA plot with original PCA outleirs removed:


The samples that appear to be outliers in both PCA plots are not the ones that appear to always be the outlier, such as the one in the 3rd image.

Any help is as always much appreciated

Entering edit mode

Regarding the MA plot, this looks fine to me, I would suspect that the female treatment effect is often stronger than the male treatment effect (as in the case of MPP4, which looks like an obvious significant gene to me), and the contrast is simply picking this up. Do you have a reason to suspect this is not the case?

You mentioned it is often a single sample which has outlier counts. Can you identify this sample by looking at the boxplots of assays(dds)[["cooks"]] ? If so, I would exclude a sample if it consistently has outliers. You have plenty of replicates, it won't hurt your power at all. If you do decide to remove this sample, I would remove both the pre and post sample for that one sample.

Because you have a model with more than 113 coefficients (you have a coefficient in the model for each pair of samples), it's easy for a single outlier to pull up the Cook's cutoff, because it is affecting the coefficient related to that pair of samples, but it's not a problem for inference on the treatment:sex terms, which is what you care about. I would just turn off Cook's related functions: cooksCutoff=FALSE, and look by eye at the genes with top mcols(dds)$maxCooks. If you don't see cases where the significance seems to be driven by a single outlier, I wouldn't worry about Cook's for this dataset. Note that, I don't think the outlier is affecting MPP4 at all. That just seems to be a gene with a true difference in treatment effect across sex.

Entering edit mode

Thank you for your thoughtful response. It's definitely reassuring to hear the MA plot is fine because we do suspect the female treatment effect is stronger.

When I remove the one sample (pre and post) in MPP4, it is no longer significant with a padj < 0.05, which makes me think that an outlier is influencing the results, although it sure looks like it ought to be a significant change without the outlier:

Additionally, when I remove the pre and post for that one sample, the number of differentially expressed genes goes from 716 to 18 at a p-adj <= 0.05, which also makes me think that an outlier is influencing the results.

Of these 18, there still appears to be significance driven by one or two samples (

Given all of this, should I stick with these 18 genes, or stick with the original 716 and accept that one point hanging in the upper right corner? I cannot express how grateful I am that you're being so generous with your time in helping me out!


Login before adding your answer.

Traffic: 393 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6