Hello all,
I am experiencing strange results from the output of DESeq2. I am getting many genes that have low counts in the conditions I am comparing. This is one striking example where a gene that is significantly differentially expressed has 0 counts in both conditions! I have also seen cases where one of the replicates has a very large amount of counts compared to the others where I would expect this to be an outlier, but is not.
I have read over these posts where there seems to be a similar situation:
DESeq2 : DE genes witth high log2FC and few or null counts in some samples
too many genes with low read-counts as differential expressed in DESeq2
High number of significant DE genes with low baseMean, is this abnormal?
Here is my command history:
# condition cond = rep(c("18_11C", "18_11R"), each = 5) cond = c(cond, rep(c("18_12C"), each = 4)) cond = c(cond, rep(c("18_12R", "18_Sib"), each = 5)) sTable = data.frame(sampleName = sampNum, fileName = files, condition = cond) dds18 <- DESeqDataSetFromHTSeqCount(sampleTable = sTable, directory = "", design = ~condition) dds18$condition <- relevel(dds18$condition, ref="18_Sib") dds18 <- dds18[rowMeans(counts(dds18)) >= 5,] # row average is at least 5 counts dds18 <- DESeq(dds18) res <- results(dds18, alpha = 0.05, contrast = c("condition", "18_12C", "18_Sib")) # determine if Xelaev18000166m|Xelaev18000166m is in the DEG (padj < 0.05) resOrdered <- res[order(res$padj),] resOrdered <- resOrdered[!is.na(resOrdered$padj),] resOrderedSDE <- resOrdered[resOrdered$padj < 0.05,] any(row.names(resOrderedSDE) == "Xelaev18000166m|Xelaev18000166m") [1] TRUE plotCounts(dds18, "Xelaev18000166m|Xelaev18000166m", intgroup = "condition")
sessionInfo() R version 3.4.0 (2017-04-21) Platform: x86_64-apple-darwin15.6.0 (64-bit) Running under: OS X El Capitan 10.11.6 Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libLAPACK.dylib locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods base other attached packages: [1] genefilter_1.58.1 FactoMineR_1.39 factoextra_1.0.5 ggrepel_0.7.0 ggplot2_2.2.1.9000 hexbin_1.27.1 vsn_3.44.0 [8] affy_1.54.0 RColorBrewer_1.1-2 edgeR_3.18.1 limma_3.32.10 DESeq2_1.16.1 SummarizedExperiment_1.6.5 DelayedArray_0.2.7 [15] matrixStats_0.52.2 Biobase_2.36.2 GenomicRanges_1.28.6 GenomeInfoDb_1.12.3 IRanges_2.10.5 S4Vectors_0.14.7 BiocGenerics_0.22.1 loaded via a namespace (and not attached): [1] bitops_1.0-6 bit64_0.9-7 rprojroot_1.3-2 tools_3.4.0 backports_1.1.2 R6_2.2.2 affyio_1.46.0 rpart_4.1-11 [9] KernSmooth_2.23-15 Hmisc_4.1-1 DBI_0.7 lazyeval_0.2.1 colorspace_1.3-2 nnet_7.3-12 gridExtra_2.3 bit_1.1-12 [17] compiler_3.4.0 preprocessCore_1.38.1 htmlTable_1.11.1 flashClust_1.01-2 labeling_0.3 scales_0.5.0 checkmate_1.8.5 stringr_1.2.0 [25] digest_0.6.13 foreign_0.8-69 rmarkdown_1.8 XVector_0.16.0 pkgconfig_2.0.1 base64enc_0.1-3 htmltools_0.3.6 htmlwidgets_0.9 [33] rlang_0.1.6 rstudioapi_0.7 RSQLite_2.0 BiocInstaller_1.26.1 bindr_0.1 BiocParallel_1.10.1 acepack_1.4.1 dplyr_0.7.4 [41] RCurl_1.95-4.10 magrittr_1.5 GenomeInfoDbData_0.99.0 Formula_1.2-2 leaps_3.0 Matrix_1.2-12 Rcpp_0.12.14 munsell_0.4.3 [49] scatterplot3d_0.3-40 stringi_1.1.6 yaml_2.1.16 MASS_7.3-48 zlibbioc_1.22.0 plyr_1.8.4 grid_3.4.0 blob_1.1.0 [57] lattice_0.20-35 splines_3.4.0 annotate_1.54.0 locfit_1.5-9.1 knitr_1.18 pillar_1.0.1 ggpubr_0.1.6 geneplotter_1.54.0 [65] glue_1.2.0 XML_3.98-1.9 evaluate_0.10.1 latticeExtra_0.6-28 data.table_1.10.4-3 purrr_0.2.4 gtable_0.2.0 assertthat_0.2.0 [73] xtable_1.8-2 survival_2.41-3 tibble_1.4.1 AnnotationDbi_1.38.2 memoise_1.1.0 bindrcpp_0.2 cluster_2.0.6
Hi Michael,
Using the LRT I was only able to load my samples of interest (can't take advantage of dispersion shrinkage). The results of using a LRT with the same comparison in my first post has 1598 DE genes while the first only had 385 DE genes. Although, glancing over the count values, the LRT test seems to have eliminated many of the low counts. I am still concerned about the difference between the Wald and LRT.
-R
I wouldn't recommend the Wald here, and so don't compare the DE gene counts to it. Just use the LRT. The LRT also shrinks the dispersion estimates.
Just to clarify, the reason the Wald shouldn't be used here is because there are many groups which suppresses count outliers?
Many groups, count outliers but outlier filtering isn’t triggered because they aren’t high enough I suppose.
Hi Michael,
I've encountered the same issue rrcutler described above. I do not think the LRT will be a feasible workaround for me. Are there other alternatives that have been developed in the few years since this was originally posted?
As another workaround, what do you think about assigning NAs in the results table to any gene that doesn't meet a minimum sum of counts within just the samples that are part of the contrast. Usually I filter genes by using a global counts cutoff prior to analysis, but I think in this case, some of the genes that have many counts across the data set (and thus pass a global filter) turn out to not have very many (or any) counts within the samples being contrasted. In my case, the full data set includes many different tissues, and this issue seems to arise when a gene is expressed highly in some of them (e.g., lung and kidney), but is expressed at low levels or not at all in a tissue within which I am trying to do a contrast (e.g., blood cell type a vs blood cell type b).
Thanks!
That is perfectly fine as a post-hoc filter. Additionally you can use
lfcShrink()
which I have found in nearly every case will shrink these LFC to very small values, so you can simply filter out any DE genes with a small moderated LFC (you can just filter on the absolute value of the shrunken LFC, or you can uselfcThreshold
in that function which will produce s-values for that threshold, see vignette for details).