Strange Cook's Values with DESeq2
1
0
Entering edit mode
@bebeeb6c
Last seen 2.7 years ago
Germany

Hi, I'm currently trying to assess fold change when comparing two different sample types using DESeq2 package and I'm getting weird Cook's distance values which are causing major problems.

The two different samples have different amounts of replicates (6 replicates vs 5 replicates) which might be the reason for these weird results (since when i remove one sample the results are no longer "weird").

So, the results:

First of all, the condition I'm using is:

   name     Type
total_1 t_24
total_2 t_24
total_3 t_24
total_4 t_24
total_5 t_24
nuc_1   n_24
nuc_2   n_24
nuc_3   n_24
nuc_4   n_24
nuc_5   n_24
nuc_6   n_24

Since the total is the control, they need to be the first for the LFC to be positive when up-regulated in the test.

Now the code I'm using is

df_conditions <- data.frame(
  sample = df_conditions$name,
  condition = df_conditions$Type
)

# The data is only a fraction of the "data_combined" were c(c(9,10,11,12,13,14) corresponds to the nuclear fraction 
DEA_matrix <- DESeqDataSetFromMatrix(data_combined[c(9,10,11,12,13,14,23,24,25,26,27)],          
                                     df_conditions, 
                                     ~condition) 

DEA <- DESeq(DEA_matrix)
DEA_results <- results(DEA)

When i ran my sample I get around 350 genes with NAs (in both p-value and padj) which, after looking both online and on the manual means that there is a problem with cooks distance value. After checking these values for the genes using assays(DEA)[["cooks"]] i saw that, for example: For a given gene which has NAs i got these cook's distance values

    nuc_1     nuc_2     nuc_3     nuc_4     nuc_5     nuc_6  total_1  total_2  total_3  total_4  total_5 
1.479392e-04 1.548888e-02 1.557630e-04 2.008926e-01 1.012255e-01 2.222557e+01 5.412296e-01 1.156913e+00 9.553727e-01 1.107007e+00 1.146971e+00

Which indicates that nuc_6 is clearly the outlier but the normalized reads do not indicate that :

nuc_1   nuc_2   nuc_3   nuc_4   nuc_5   nuc_6   total_1 total_2 total_3 total_4 total_5
400      350    400     450      300     400      50     30      30      30     30

Does anyone have any idea as to why the cook's value is so weird and indicating an outlier when clearly there isnt one? Could it be the difference in the sample number? I've already tried inverting the order of the condition dataframe (put the nuclear first) but the results remain the same so I don't think its wrongly identifying the nuc_6 as part of the total.

Any ideas?

sessionInfo( )
R version 4.2.2 (2022-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22621)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.utf8  LC_CTYPE=English_United States.utf8    LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                           LC_TIME=English_United States.utf8    

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] forcats_1.0.0               scales_1.2.1                AnnotationDbi_1.60.0        enrichplot_1.18.3          
 [5] GGally_2.1.2                clusterProfiler_4.6.0       preprocessCore_1.60.0       DESeq2_1.38.1              
 [9] SummarizedExperiment_1.28.0 Biobase_2.58.0              MatrixGenerics_1.10.0       matrixStats_0.63.0         
[13] GenomicRanges_1.49.0        GenomeInfoDb_1.34.4         IRanges_2.32.0              S4Vectors_0.36.1           
[17] BiocGenerics_0.44.0         EnhancedVolcano_1.16.0      ggrepel_0.9.2               ggplot2_3.4.0              
[21] gplots_3.1.3                WGCNA_1.72-1                fastcluster_1.2.3           dynamicTreeCut_1.63-1      
[25] readxl_1.4.1               

loaded via a namespace (and not attached):
  [1] backports_1.4.1        shadowtext_0.1.2       Hmisc_4.7-2            fastmatch_1.1-3        plyr_1.8.8             igraph_1.3.5          
  [7] lazyeval_0.2.2         splines_4.2.2          BiocParallel_1.32.4    digest_0.6.30          htmltools_0.5.4        foreach_1.5.2         
 [13] yulab.utils_0.0.6      GOSemSim_2.24.0        viridis_0.6.2          GO.db_3.16.0           fansi_1.0.3            magrittr_2.0.3        
 [19] checkmate_2.1.0        memoise_2.0.1          cluster_2.1.4          doParallel_1.0.17      Biostrings_2.66.0      annotate_1.76.0       
 [25] graphlayouts_0.8.4     jpeg_0.1-10            colorspace_2.0-3       blob_1.2.3             xfun_0.35              dplyr_1.0.10          
 [31] crayon_1.5.2           RCurl_1.98-1.9         jsonlite_1.8.4         scatterpie_0.1.8       impute_1.72.3          survival_3.4-0        
 [37] iterators_1.0.14       ape_5.6-2              glue_1.6.2             polyclip_1.10-4        gtable_0.3.1           zlibbioc_1.44.0       
 [43] XVector_0.38.0         DelayedArray_0.23.2    DOSE_3.24.2            DBI_1.1.3              Rcpp_1.0.9             viridisLite_0.4.1     
 [49] xtable_1.8-4           htmlTable_2.4.1        gridGraphics_0.5-1     tidytree_0.4.2         foreign_0.8-83         bit_4.0.5             
 [55] Formula_1.2-4          htmlwidgets_1.6.1      httr_1.4.4             fgsea_1.24.0           RColorBrewer_1.1-3     reshape_0.8.9         
 [61] pkgconfig_2.0.3        XML_3.99-0.13          farver_2.1.1           nnet_7.3-18            deldir_1.0-6           locfit_1.5-9.6        
 [67] utf8_1.2.2             ggplotify_0.1.0        tidyselect_1.2.0       rlang_1.0.6            reshape2_1.4.4         munsell_0.5.0         
 [73] cellranger_1.1.0       tools_4.2.2            cachem_1.0.6           downloader_0.4         cli_3.4.1              generics_0.1.3        
 [79] RSQLite_2.2.19         gson_0.0.9             stringr_1.5.0          fastmap_1.1.0          ggtree_3.6.2           knitr_1.42            
 [85] bit64_4.0.5            tidygraph_1.2.2        caTools_1.18.2         purrr_0.3.5            KEGGREST_1.38.0        ggraph_2.1.0          
 [91] nlme_3.1-160           aplot_0.1.9            compiler_4.2.2         rstudioapi_0.14        png_0.1-8              treeio_1.22.0         
 [97] tibble_3.1.8           tweenr_2.0.2           geneplotter_1.76.0     stringi_1.7.8          lattice_0.20-45        Matrix_1.5-1          
[103] vctrs_0.5.1            pillar_1.8.1           lifecycle_1.0.3        data.table_1.14.6      cowplot_1.1.1          bitops_1.0-7          
[109] patchwork_1.1.2        qvalue_2.30.0          R6_2.5.1               latticeExtra_0.6-30    KernSmooth_2.23-20     gridExtra_2.3         
[115] codetools_0.2-18       gtools_3.9.4           MASS_7.3-58.1          assertthat_0.2.1       withr_2.5.0            GenomeInfoDbData_1.2.9
[121] parallel_4.2.2         grid_4.2.2             rpart_4.1.19           ggfun_0.0.9            tidyr_1.2.1            HDO.db_0.99.1         
[127] ggforce_0.4.1          base64enc_0.1-3        interp_1.1-3
DESeq2 • 623 views
ADD COMMENT
0
Entering edit mode
@mikelove
Last seen 1 day ago
United States

Is this a two group analysis? What do you get with:

results(dds)
ADD COMMENT

Login before adding your answer.

Traffic: 695 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6