Question

Low Counts in top DE genes in DESeq2

1

Entering edit mode

rrcutler ▴ 70

@rrcutler-10667

Last seen 4.7 years ago

Hello all,

I am experiencing strange results from the output of DESeq2. I am getting many genes that have low counts in the conditions I am comparing. This is one striking example where a gene that is significantly differentially expressed has 0 counts in both conditions! I have also seen cases where one of the replicates has a very large amount of counts compared to the others where I would expect this to be an outlier, but is not.

I have read over these posts where there seems to be a similar situation:

DESeq2 : DE genes witth high log2FC and few or null counts in some samples

too many genes with low read-counts as differential expressed in DESeq2

High number of significant DE genes with low baseMean, is this abnormal?

Here is my command history:

# condition
cond = rep(c("18_11C", "18_11R"), each = 5)
cond = c(cond, rep(c("18_12C"), each = 4))
cond = c(cond, rep(c("18_12R", "18_Sib"), each = 5))

sTable = data.frame(sampleName = sampNum, fileName = files, condition = cond)

dds18 <- DESeqDataSetFromHTSeqCount(sampleTable = sTable, directory = "", design = ~condition)

dds18$condition <- relevel(dds18$condition, ref="18_Sib")

dds18 <- dds18[rowMeans(counts(dds18)) >= 5,] # row average is at least 5 counts

dds18 <- DESeq(dds18)

res <- results(dds18, alpha = 0.05, contrast = c("condition", "18_12C", "18_Sib"))

# determine if Xelaev18000166m|Xelaev18000166m is in the DEG (padj < 0.05)

resOrdered <- res[order(res$padj),]

resOrdered <- resOrdered[!is.na(resOrdered$padj),]

resOrderedSDE <- resOrdered[resOrdered$padj < 0.05,]

any(row.names(resOrderedSDE) == "Xelaev18000166m|Xelaev18000166m")
[1] TRUE

plotCounts(dds18, "Xelaev18000166m|Xelaev18000166m", intgroup = "condition")

sessionInfo()

R version 3.4.0 (2017-04-21)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: OS X El Capitan 10.11.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libLAPACK.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] genefilter_1.58.1          FactoMineR_1.39            factoextra_1.0.5           ggrepel_0.7.0              ggplot2_2.2.1.9000         hexbin_1.27.1              vsn_3.44.0                
 [8] affy_1.54.0                RColorBrewer_1.1-2         edgeR_3.18.1               limma_3.32.10              DESeq2_1.16.1              SummarizedExperiment_1.6.5 DelayedArray_0.2.7        
[15] matrixStats_0.52.2         Biobase_2.36.2             GenomicRanges_1.28.6       GenomeInfoDb_1.12.3        IRanges_2.10.5             S4Vectors_0.14.7           BiocGenerics_0.22.1       

loaded via a namespace (and not attached):
 [1] bitops_1.0-6            bit64_0.9-7             rprojroot_1.3-2         tools_3.4.0             backports_1.1.2         R6_2.2.2                affyio_1.46.0           rpart_4.1-11           
 [9] KernSmooth_2.23-15      Hmisc_4.1-1             DBI_0.7                 lazyeval_0.2.1          colorspace_1.3-2        nnet_7.3-12             gridExtra_2.3           bit_1.1-12             
[17] compiler_3.4.0          preprocessCore_1.38.1   htmlTable_1.11.1        flashClust_1.01-2       labeling_0.3            scales_0.5.0            checkmate_1.8.5         stringr_1.2.0          
[25] digest_0.6.13           foreign_0.8-69          rmarkdown_1.8           XVector_0.16.0          pkgconfig_2.0.1         base64enc_0.1-3         htmltools_0.3.6         htmlwidgets_0.9        
[33] rlang_0.1.6             rstudioapi_0.7          RSQLite_2.0             BiocInstaller_1.26.1    bindr_0.1               BiocParallel_1.10.1     acepack_1.4.1           dplyr_0.7.4            
[41] RCurl_1.95-4.10         magrittr_1.5            GenomeInfoDbData_0.99.0 Formula_1.2-2           leaps_3.0               Matrix_1.2-12           Rcpp_0.12.14            munsell_0.4.3          
[49] scatterplot3d_0.3-40    stringi_1.1.6           yaml_2.1.16             MASS_7.3-48             zlibbioc_1.22.0         plyr_1.8.4              grid_3.4.0              blob_1.1.0             
[57] lattice_0.20-35         splines_3.4.0           annotate_1.54.0         locfit_1.5-9.1          knitr_1.18              pillar_1.0.1            ggpubr_0.1.6            geneplotter_1.54.0     
[65] glue_1.2.0              XML_3.98-1.9            evaluate_0.10.1         latticeExtra_0.6-28     data.table_1.10.4-3     purrr_0.2.4             gtable_0.2.0            assertthat_0.2.0       
[73] xtable_1.8-2            survival_2.41-3         tibble_1.4.1            AnnotationDbi_1.38.2    memoise_1.1.0           bindrcpp_0.2            cluster_2.0.6

deseq2 • 2.2k views

ADD COMMENT • link updated 6.8 years ago by Michael Love 42k • written 6.8 years ago by rrcutler ▴ 70

score 0 · Answer 1 · 2018-01-28

0

Entering edit mode

Michael Love 42k

@mikelove

Last seen 8 hours ago

United States

Can you use the LRT here instead and report your results? The Wald statistic seems to have this issue with count outliers (which nevertheless are not strong enough to trigger outlier procedures) and the multigroup designs. The LRT p-values should perform better. I’m focused now on developing new shrinkage estimators which would shrink these to zero, so they wouldn’t appear in the results table.

ADD COMMENT • link 6.8 years ago Michael Love 42k

0

Entering edit mode

Hi Michael,

Using the LRT I was only able to load my samples of interest (can't take advantage of dispersion shrinkage). The results of using a LRT with the same comparison in my first post has 1598 DE genes while the first only had 385 DE genes. Although, glancing over the count values, the LRT test seems to have eliminated many of the low counts. I am still concerned about the difference between the Wald and LRT.

-R

ADD REPLY • link 6.8 years ago rrcutler ▴ 70

0

Entering edit mode

I wouldn't recommend the Wald here, and so don't compare the DE gene counts to it. Just use the LRT. The LRT also shrinks the dispersion estimates.

ADD REPLY • link 6.8 years ago Michael Love 42k

0

Entering edit mode

Just to clarify, the reason the Wald shouldn't be used here is because there are many groups which suppresses count outliers?

ADD REPLY • link 6.8 years ago rrcutler ▴ 70

0

Entering edit mode

Many groups, count outliers but outlier filtering isn’t triggered because they aren’t high enough I suppose.

ADD REPLY • link 6.8 years ago Michael Love 42k

0

Entering edit mode

Hi Michael,

I've encountered the same issue rrcutler described above. I do not think the LRT will be a feasible workaround for me. Are there other alternatives that have been developed in the few years since this was originally posted?

As another workaround, what do you think about assigning NAs in the results table to any gene that doesn't meet a minimum sum of counts within just the samples that are part of the contrast. Usually I filter genes by using a global counts cutoff prior to analysis, but I think in this case, some of the genes that have many counts across the data set (and thus pass a global filter) turn out to not have very many (or any) counts within the samples being contrasted. In my case, the full data set includes many different tissues, and this issue seems to arise when a gene is expressed highly in some of them (e.g., lung and kidney), but is expressed at low levels or not at all in a tissue within which I am trying to do a contrast (e.g., blood cell type a vs blood cell type b).

Thanks!

ADD REPLY • link 4.1 years ago penny.lane • 0

0

Entering edit mode

That is perfectly fine as a post-hoc filter. Additionally you can use lfcShrink() which I have found in nearly every case will shrink these LFC to very small values, so you can simply filter out any DE genes with a small moderated LFC (you can just filter on the absolute value of the shrunken LFC, or you can use lfcThreshold in that function which will produce s-values for that threshold, see vignette for details).

ADD REPLY • link 4.1 years ago Michael Love 42k