Cook's Distance Outliers: None reported in results table and "cooksCutoff" modification has no effect
1
0
Entering edit mode
@donnenfieldjonah-23963
Last seen 12 months ago

Hello everyone,

I am trying to increase the number of outliers identified and replaced in my use of DESeq2.

I create my DESeq object and get the following result

## Create DESeq2Dataset object
metadata$groups <- factor(metadata$groups, levels = c("CON","INJL","INJS","RCNL","RCNS","REPL","REPS"))
dds <- DESeqDataSetFromMatrix(count_file[,c(2:43)], colData = metadata, design = ~ groups)
## Run analysis
dds <- DESeq(dds, minReplicatesForReplace = 6)

estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
-- replacing outliers and refitting for 39 genes
-- DESeq argument 'minReplicatesForReplace' = 6 
-- original counts are preserved in counts(dds)
estimating dispersions
fitting model and testing

As I understand it, this means that 39 genes (i.e. rows from count_file[,c(2:43)]) were identified as having at least one count above the default cook's distance threshold--which in my case is 0.99*F(7,42-7) = 3.199952. Here, 7 is my number of groups, and 42 is my number of samples.

Upon looking at my final DEG's, I realized that I should try setting this cook's distance threshold lower because some of my top hits are still clearly being driven by outliers. Once again, I want more than 39 genes to be replaced.

As I understand it, I should be able to accomplish this by modifying my results function like the following:

contrast_REPL_RCNL <- c("groups", "REPL", "RCNL")
res_table_REPL_RCNL <- results(dds, contrast = contrast_REPL_RCNL, alpha = 0.05) # original
res_table_REPL_RCNL <- results(dds, contrast = contrast_REPL_RCNL, alpha = 0.05, cooksCutoff=1.0) # modified results function

However, this does nothing to change the results table output. In fact, the results table gives this result regardless of cooksCutoff:

## Number of genes removed due to extreme cook's distance values, where TRUE == removal
table(res_table_REPL_RCNL$baseMean > 0 & is.na(res_table_REPL_RCNL$pvalue))
FALSE 
31887

I'm very lost as to why I can't modify the cooksCutoff. My end goal is to have DESeq replace more than 39 genes. Could someone explain how I could make this happen?

Here is my sessionInfo()

> sessionInfo()
R version 4.0.1 (2020-06-06)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.3

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] sjPlot_2.8.4                viridis_0.5.1               viridisLite_0.3.0           MASS_7.3-51.6               org.Ss.eg.db_3.11.4        
 [6] clusterProfiler_3.16.0      pathview_1.28.1             DOSE_3.14.0                 gridExtra_2.3               biomaRt_2.44.1             
[11] GenomicFeatures_1.40.1      AnnotationDbi_1.50.3        apeglm_1.10.0               ggrepel_0.8.2               tximport_1.16.1            
[16] DEGreport_1.24.1            pheatmap_1.0.12             RColorBrewer_1.1-2          forcats_0.5.0               stringr_1.4.0              
[21] dplyr_1.0.1                 purrr_0.3.4                 readr_1.3.1                 tidyr_1.1.1                 tibble_3.0.3               
[26] ggplot2_3.3.2               tidyverse_1.3.0             DESeq2_1.28.1               SummarizedExperiment_1.18.2 DelayedArray_0.14.1        
[31] matrixStats_0.56.0          Biobase_2.48.0              GenomicRanges_1.40.0        GenomeInfoDb_1.24.2         IRanges_2.22.2             
[36] S4Vectors_0.26.1            BiocGenerics_0.34.0         edgeR_3.30.3                limma_3.44.3               

loaded via a namespace (and not attached):
  [1] utf8_1.1.4                  lme4_1.1-23                 tidyselect_1.1.0            RSQLite_2.2.0               grid_4.0.1                 
  [6] BiocParallel_1.22.0         scatterpie_0.1.4            munsell_0.5.0               effectsize_0.3.2            statmod_1.4.34             
 [11] withr_2.2.0                 colorspace_1.4-1            GOSemSim_2.14.1             knitr_1.29                  rstudioapi_0.11            
 [16] emmeans_1.4.8               labeling_0.3                KEGGgraph_1.48.0            lasso2_1.2-20               urltools_1.7.3             
 [21] bbmle_1.0.23.1              GenomeInfoDbData_1.2.3      mnormt_2.0.1                polyclip_1.10-0             bit64_4.0.2                
 [26] farver_2.0.3                rprojroot_1.3-2             downloader_0.4              coda_0.19-3                 vctrs_0.3.2                
 [31] generics_0.0.2              xfun_0.16                   BiocFileCache_1.12.1        R6_2.4.1                    clue_0.3-57                
 [36] graphlayouts_0.7.0          locfit_1.5-9.4              gridGraphics_0.5-0          bitops_1.0-6                reshape_0.8.8              
 [41] fgsea_1.14.0                assertthat_0.2.1            scales_1.1.1                ggraph_2.0.3                enrichplot_1.8.1           
 [46] gtable_0.3.0                tidygraph_1.2.0             rlang_0.4.7                 genefilter_1.70.0           GlobalOptions_0.1.2        
 [51] splines_4.0.1               rtracklayer_1.48.0          europepmc_0.4               broom_0.7.0                 BiocManager_1.30.10        
 [56] yaml_2.2.1                  reshape2_1.4.4              modelr_0.1.8                backports_1.1.8             qvalue_2.20.0              
 [61] tools_4.0.1                 psych_2.0.7                 ggplotify_0.0.5             logging_0.10-108            ellipsis_0.3.1             
 [66] ggdendro_0.1-20             ggridges_0.5.2              Rcpp_1.0.5                  plyr_1.8.6                  progress_1.2.2             
 [71] zlibbioc_1.34.0             RCurl_1.98-1.2              prettyunits_1.1.1           openssl_1.4.2               GetoptLong_1.0.2           
 [76] cowplot_1.0.0               haven_2.3.1                 cluster_2.1.0               fs_1.5.0                    magrittr_1.5               
 [81] data.table_1.13.0           DO.db_2.9                   circlize_0.4.10             triebeard_0.3.0             reprex_0.3.0               
 [86] tmvnsim_1.0-2               mvtnorm_1.1-1               sjmisc_2.8.5                pkgload_1.1.0               hms_0.5.3                  
 [91] xtable_1.8-4                XML_3.99-0.5                sjstats_0.18.0              emdbook_1.3.12              readxl_1.3.1               
 [96] ggeffects_0.15.1            shape_1.4.4                 testthat_2.3.2              compiler_4.0.1              bdsmatrix_1.3-4            
[101] crayon_1.3.4                minqa_1.2.4                 geneplotter_1.66.0          lubridate_1.7.9             DBI_1.1.0                  
[106] sjlabelled_1.1.6            tweenr_1.0.1                dbplyr_1.4.4                ComplexHeatmap_2.4.3        rappdirs_0.3.1             
[111] boot_1.3-25                 Matrix_1.2-18               cli_2.0.2                   insight_0.9.0               igraph_1.2.5               
[116] pkgconfig_2.0.3             rvcheck_0.1.8               GenomicAlignments_1.24.0    numDeriv_2016.8-1.1         xml2_1.3.2                 
[121] annotate_1.66.0             XVector_0.28.0              estimability_1.3            rvest_0.3.6                 digest_0.6.25              
[126] parameters_0.8.2            ConsensusClusterPlus_1.52.0 graph_1.66.0                Biostrings_2.56.0           cellranger_1.1.0           
[131] fastmatch_1.1-0             curl_4.3                    Rsamtools_2.4.0             nloptr_1.2.2.2              rjson_0.2.20               
[136] lifecycle_0.2.0             nlme_3.1-148                jsonlite_1.7.0              desc_1.2.0                  askpass_1.1                
[141] fansi_0.4.1                 pillar_1.4.6                lattice_0.20-41             Nozzle.R1_1.1-1             KEGGREST_1.28.0            
[146] httr_1.4.2                  survival_3.2-3              GO.db_3.11.4                glue_1.4.1                  bayestestR_0.7.2           
[151] png_0.1-7                   bit_4.0.4                   Rgraphviz_2.32.0            performance_0.4.8           ggforce_0.3.2              
[156] stringi_1.4.6               blob_1.2.1                  org.Hs.eg.db_3.11.4         memoise_1.1.0  
deseq2 outliers cooks cooks distance • 594 views
ADD COMMENT
4
Entering edit mode
@mikelove
Last seen 1 day ago
United States

"Once again, I want more than 39 genes to be replaced."

The outlier replacement occurs in DESeq() but we don't have an option to make the replacement threshold lower. It is set at the recommended level based on sample size and number of covariates.

Instead cooksCutoff is an argument to results() that decides what counts as an outlier to set a p-value to NA.

"...some of my top hits are still clearly being driven by outliers"

How do the counts for these look? Could they be removed with simple filters, e.g. more than x number of samples with a count of 10 or more?

ADD COMMENT
0
Entering edit mode

Hi Michael,

Here is a count plot for one of my top DEG's (IGFBP5 or ENSSSCG00000025924): https://ibb.co/txFS1TZ

The comparison here is between the purple REPL group and the green RCNL group. The Cook's distance for the outlier sample is 1.36522020, whereas all other samples in the comparison are well below 0.08. Here is the code for finding the Cook's distances:

assays(dds)[["cooks"]][which(rownames(dds)=="ENSSSCG00000025924"),grep("RCNL",colnames(dds))]
     RCNL1      RCNL2      RCNL3      RCNL4      RCNL5      RCNL6 
1.36522020 0.07550182 0.02660505 0.07296530 0.05653308 0.04958655

assays(dds)[["cooks"]][which(rownames(dds)=="ENSSSCG00000025924"),grep("REPL",colnames(dds))]
         REPL1          REPL2          REPL3          REPL4          REPL5          REPL6 
0.000995423257 0.000007282814 0.112372588273 0.007455832517 0.007255683703 0.018162923801

The real problem is that setting cooksCutoff to 1.0 doesn't seem to change the results() output:

res_table_REPL_RCNL <- results(dds, contrast = contrast_REPL_RCNL, alpha = 0.05)
as.data.frame(res_table_REPL_RCNL)[which(rownames(dds)=="ENSSSCG00000025924"),]
                   baseMean log2FoldChange     lfcSE      stat       pvalue       padj
ENSSSCG00000025924 6577.533      -2.054216 0.5286512 -3.885767 0.0001020071 0.01633134

res_table_REPL_RCNL <- results(dds, contrast = contrast_REPL_RCNL, alpha = 0.05, cooksCutoff = 1.0)
                   baseMean log2FoldChange     lfcSE      stat       pvalue       padj
ENSSSCG00000025924 6577.533      -2.054216 0.5286512 -3.885767 0.0001020071 0.01633134

Shouldn't this gene be excluded from the results table because one of its Cook's distances is over the cutoff? I could remove this gene manually or with a count cutoff, but I figured results() would remove it based on one of the samples being over cooksCutoff... Am I doing something wrong with the results() function?

ADD REPLY
0
Entering edit mode

What do you have in mcols(dds)$maxCooks for this gene? I am surprised that the last line didn't NA the p-value.

ADD REPLY
0
Entering edit mode

mcols(dds)$maxCooks is NA for every gene (which includes the gene of interest). This might be the source of the issue. Any idea as to why this is?

ADD REPLY
2
Entering edit mode

Ah I am remembering the logic now. The outlier replacement procedure is as follows: filtering will only occur after replacement for groups that were too small for replacement. So filtering here would only be triggered for groups with 3-5 samples. So because the outlier is too small to hit the threshold for replacement it isn't considered for filtering. You could turn off replacement in DESeq() (minRep=Inf), which would allow such a gene to be filtered with results(). Or you could manually pull out these genes using your code above, scanning for large Cook's values.

I've never been satisfied with the outlier procedures for GLM. I am sympathetic to the nonparametric approaches, e.g. SAMseq, which are robust to data that doesn't fit the distribution (e.g. the above count is not really consistent with Negative Binomial).

ADD REPLY
0
Entering edit mode

Ah! That successfully filtered out the gene of interest. Thank you!

Per your last comment, are you suggesting that I utilize SAMseq instead of DESeq because some of my counts don't fit the Negative Binomial? I would say that my data largely does fit the NB distribution with the exception of a few genes.

ADD REPLY
0
Entering edit mode

Ah! That successfully filtered out the gene of interest. Thank you!

Per your last comment, are you suggesting that I utilize SAMseq instead of DESeq because some of my counts don't fit the Negative Binomial? I would say that my data largely does fit the NB distribution with the exception of a few genes.

ADD REPLY
1
Entering edit mode

Re: my last comment, no I think filtering can work. I made that comment more to express my frustration with outlier procedures, which require a lot of hands-on inspection to be trusted (this is ok, this is what you've done and you've found a setting that works through manual inspection), compared to SAMseq, which "just works".

ADD REPLY

Login before adding your answer.

Traffic: 201 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6