Hi, I'm currently trying to assess fold change when comparing two different sample types using DESeq2 package and I'm getting weird Cook's distance values which are causing major problems.
The two different samples have different amounts of replicates (6 replicates vs 5 replicates) which might be the reason for these weird results (since when i remove one sample the results are no longer "weird").
So, the results:
First of all, the condition I'm using is:
name Type
total_1 t_24
total_2 t_24
total_3 t_24
total_4 t_24
total_5 t_24
nuc_1 n_24
nuc_2 n_24
nuc_3 n_24
nuc_4 n_24
nuc_5 n_24
nuc_6 n_24
Since the total is the control, they need to be the first for the LFC to be positive when up-regulated in the test.
Now the code I'm using is
df_conditions <- data.frame(
sample = df_conditions$name,
condition = df_conditions$Type
)
# The data is only a fraction of the "data_combined" were c(c(9,10,11,12,13,14) corresponds to the nuclear fraction
DEA_matrix <- DESeqDataSetFromMatrix(data_combined[c(9,10,11,12,13,14,23,24,25,26,27)],
df_conditions,
~condition)
DEA <- DESeq(DEA_matrix)
DEA_results <- results(DEA)
When i ran my sample I get around 350 genes with NAs (in both p-value and padj) which, after looking both online and on the manual means that there is a problem with cooks distance value.
After checking these values for the genes using assays(DEA)[["cooks"]] i saw that, for example:
For a given gene which has NAs i got these cook's distance values
nuc_1 nuc_2 nuc_3 nuc_4 nuc_5 nuc_6 total_1 total_2 total_3 total_4 total_5
1.479392e-04 1.548888e-02 1.557630e-04 2.008926e-01 1.012255e-01 2.222557e+01 5.412296e-01 1.156913e+00 9.553727e-01 1.107007e+00 1.146971e+00
Which indicates that nuc_6 is clearly the outlier but the normalized reads do not indicate that :
nuc_1 nuc_2 nuc_3 nuc_4 nuc_5 nuc_6 total_1 total_2 total_3 total_4 total_5
400 350 400 450 300 400 50 30 30 30 30
Does anyone have any idea as to why the cook's value is so weird and indicating an outlier when clearly there isnt one? Could it be the difference in the sample number? I've already tried inverting the order of the condition dataframe (put the nuclear first) but the results remain the same so I don't think its wrongly identifying the nuc_6 as part of the total.
Any ideas?
sessionInfo( )
R version 4.2.2 (2022-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22621)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.utf8 LC_CTYPE=English_United States.utf8 LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C LC_TIME=English_United States.utf8
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] forcats_1.0.0 scales_1.2.1 AnnotationDbi_1.60.0 enrichplot_1.18.3
[5] GGally_2.1.2 clusterProfiler_4.6.0 preprocessCore_1.60.0 DESeq2_1.38.1
[9] SummarizedExperiment_1.28.0 Biobase_2.58.0 MatrixGenerics_1.10.0 matrixStats_0.63.0
[13] GenomicRanges_1.49.0 GenomeInfoDb_1.34.4 IRanges_2.32.0 S4Vectors_0.36.1
[17] BiocGenerics_0.44.0 EnhancedVolcano_1.16.0 ggrepel_0.9.2 ggplot2_3.4.0
[21] gplots_3.1.3 WGCNA_1.72-1 fastcluster_1.2.3 dynamicTreeCut_1.63-1
[25] readxl_1.4.1
loaded via a namespace (and not attached):
[1] backports_1.4.1 shadowtext_0.1.2 Hmisc_4.7-2 fastmatch_1.1-3 plyr_1.8.8 igraph_1.3.5
[7] lazyeval_0.2.2 splines_4.2.2 BiocParallel_1.32.4 digest_0.6.30 htmltools_0.5.4 foreach_1.5.2
[13] yulab.utils_0.0.6 GOSemSim_2.24.0 viridis_0.6.2 GO.db_3.16.0 fansi_1.0.3 magrittr_2.0.3
[19] checkmate_2.1.0 memoise_2.0.1 cluster_2.1.4 doParallel_1.0.17 Biostrings_2.66.0 annotate_1.76.0
[25] graphlayouts_0.8.4 jpeg_0.1-10 colorspace_2.0-3 blob_1.2.3 xfun_0.35 dplyr_1.0.10
[31] crayon_1.5.2 RCurl_1.98-1.9 jsonlite_1.8.4 scatterpie_0.1.8 impute_1.72.3 survival_3.4-0
[37] iterators_1.0.14 ape_5.6-2 glue_1.6.2 polyclip_1.10-4 gtable_0.3.1 zlibbioc_1.44.0
[43] XVector_0.38.0 DelayedArray_0.23.2 DOSE_3.24.2 DBI_1.1.3 Rcpp_1.0.9 viridisLite_0.4.1
[49] xtable_1.8-4 htmlTable_2.4.1 gridGraphics_0.5-1 tidytree_0.4.2 foreign_0.8-83 bit_4.0.5
[55] Formula_1.2-4 htmlwidgets_1.6.1 httr_1.4.4 fgsea_1.24.0 RColorBrewer_1.1-3 reshape_0.8.9
[61] pkgconfig_2.0.3 XML_3.99-0.13 farver_2.1.1 nnet_7.3-18 deldir_1.0-6 locfit_1.5-9.6
[67] utf8_1.2.2 ggplotify_0.1.0 tidyselect_1.2.0 rlang_1.0.6 reshape2_1.4.4 munsell_0.5.0
[73] cellranger_1.1.0 tools_4.2.2 cachem_1.0.6 downloader_0.4 cli_3.4.1 generics_0.1.3
[79] RSQLite_2.2.19 gson_0.0.9 stringr_1.5.0 fastmap_1.1.0 ggtree_3.6.2 knitr_1.42
[85] bit64_4.0.5 tidygraph_1.2.2 caTools_1.18.2 purrr_0.3.5 KEGGREST_1.38.0 ggraph_2.1.0
[91] nlme_3.1-160 aplot_0.1.9 compiler_4.2.2 rstudioapi_0.14 png_0.1-8 treeio_1.22.0
[97] tibble_3.1.8 tweenr_2.0.2 geneplotter_1.76.0 stringi_1.7.8 lattice_0.20-45 Matrix_1.5-1
[103] vctrs_0.5.1 pillar_1.8.1 lifecycle_1.0.3 data.table_1.14.6 cowplot_1.1.1 bitops_1.0-7
[109] patchwork_1.1.2 qvalue_2.30.0 R6_2.5.1 latticeExtra_0.6-30 KernSmooth_2.23-20 gridExtra_2.3
[115] codetools_0.2-18 gtools_3.9.4 MASS_7.3-58.1 assertthat_0.2.1 withr_2.5.0 GenomeInfoDbData_1.2.9
[121] parallel_4.2.2 grid_4.2.2 rpart_4.1.19 ggfun_0.0.9 tidyr_1.2.1 HDO.db_0.99.1
[127] ggforce_0.4.1 base64enc_0.1-3 interp_1.1-3
