Hi, I am analysing my gene expression data with DESeq2 (v1.16.1).
I have 12 samples in each of the two groups I am comparing.
Some genes have been set to 0/NA by independent filtering, however my top 2 most differentially expressed genes have read counts of 0 in every sample except one (where the raw read count is 6 in each case).
I don't understand why these have not been filtered out. Please could you advise?
Complete code and session info below:
> rm(list=ls(all=TRUE))
> library(DESeq2)
> directory<-'~/Google Drive/R_projects/doxy_deseq2/T1_A0vA1'
> sampleFiles <- grep("R-",list.files(directory),value=TRUE)
> sampleCondition <- gsub("R.*","",sampleFiles)
> sampleTable <- data.frame(sampleName = sampleFiles,fileName = sampleFiles,condition = sampleCondition)
> head(sampleTable)
sampleName fileName condition
1 A_0_T1_R-1-0303.txt A_0_T1_R-1-0303.txt A_0_T1_
2 A_0_T1_R-1-0349.txt A_0_T1_R-1-0349.txt A_0_T1_
3 A_0_T1_R-1-0376.txt A_0_T1_R-1-0376.txt A_0_T1_
4 A_0_T1_R-1-0391.txt A_0_T1_R-1-0391.txt A_0_T1_
5 A_0_T1_R-1-0456.txt A_0_T1_R-1-0456.txt A_0_T1_
6 A_0_T1_R-1-0464.txt A_0_T1_R-1-0464.txt A_0_T1_
> ddsHTSeq <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable, directory = directory, design= ~ condition)
> dds <- ddsHTSeq[ rowSums(counts(ddsHTSeq)) > 1, ]
> dds$condition <- relevel(dds$condition, ref="A_0_T1_")
> dds <- DESeq(dds)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
-- replacing outliers and refitting for 543 genes
-- DESeq argument 'minReplicatesForReplace' = 7
-- original counts are preserved in counts(dds)
estimating dispersions
fitting model and testing
> res <- results(dds)
> resOrdered <- res[order(res$padj),]
> resOrdered
log2 fold change (MLE): condition A 1 T1 vs A 0 T1
Wald test p-value: condition A 1 T1 vs A 0 T1
DataFrame with 26689 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
ENSG00000224721 1.024156 17.9432461 2.6726772 6.713585 1.898996e-11 2.523576e-07
ENSG00000257986 1.024156 17.9432461 2.6726772 6.713585 1.898996e-11 2.523576e-07
ENSG00000196159 4.798811 5.8913645 1.1092384 5.311180 1.089179e-07 9.649402e-04
ENSG00000179820 991.787350 1.2533887 0.2704511 4.634437 3.579100e-06 2.378133e-02
ENSG00000231389 379.860911 -0.7549993 0.1720601 -4.387998 1.143988e-05 6.080983e-02
... ... ... ... ... ... ...
ENSG00000278153 0 0 0 0 1 NA
ENSG00000278642 0 0 0 0 1 NA
ENSG00000278840 0 0 0 0 1 NA
ENSG00000279047 0 0 0 0 1 NA
ENSG00000280365 0 0 0 0 1 NA
> plotCounts(dds, gene=which.min(res$padj), intgroup="condition")
> plotCounts(dds, gene="ENSG00000257986", intgroup="condition")
> use <- res$baseMean > metadata(res)$filterThreshold
> h1 <- hist(res$pvalue[!use], breaks=0:50/50, plot=FALSE)
> h2 <- hist(res$pvalue[use], breaks=0:50/50, plot=FALSE)
> colori <- c(`do not pass`="khaki", `pass`="powderblue")
> barplot(height = rbind(h1$counts, h2$counts), beside = FALSE,
+ col = colori, space = 0, main = "", ylab="frequency")
> text(x = c(0, length(h1$counts)), y = 0, label = paste(c(0,1)),
+ adj = c(0.5,1.7), xpd=NA)
> legend("topleft", fill=rev(colori), legend=rev(names(colori)))
> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: OS X El Capitan 10.11.6
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods
[9] base
other attached packages:
[1] DESeq2_1.16.1 BiocInstaller_1.26.0 SummarizedExperiment_1.6.1
[4] DelayedArray_0.2.2 matrixStats_0.52.2 Biobase_2.36.2
[7] GenomicRanges_1.28.1 GenomeInfoDb_1.12.0 IRanges_2.10.0
[10] S4Vectors_0.14.0 BiocGenerics_0.22.0
loaded via a namespace (and not attached):
[1] genefilter_1.58.1 locfit_1.5-9.1 splines_3.4.0
[4] lattice_0.20-35 colorspace_1.3-2 htmltools_0.3.6
[7] base64enc_0.1-3 survival_2.41-3 XML_3.98-1.7
[10] foreign_0.8-68 DBI_0.6-1 BiocParallel_1.10.1
[13] RColorBrewer_1.1-2 GenomeInfoDbData_0.99.0 plyr_1.8.4
[16] stringr_1.2.0 zlibbioc_1.22.0 munsell_0.4.3
[19] gtable_0.2.0 htmlwidgets_0.8 memoise_1.1.0
[22] latticeExtra_0.6-28 knitr_1.15.1 geneplotter_1.54.0
[25] AnnotationDbi_1.38.0 htmlTable_1.9 Rcpp_0.12.10
[28] acepack_1.4.1 xtable_1.8-2 scales_0.4.1
[31] backports_1.0.5 checkmate_1.8.2 Hmisc_4.0-3
[34] annotate_1.54.0 XVector_0.16.0 gridExtra_2.2.1
[37] ggplot2_2.2.1 digest_0.6.12 stringi_1.1.5
[40] grid_3.4.0 tools_3.4.0 bitops_1.0-6
[43] magrittr_1.5 RSQLite_1.1-2 lazyeval_0.2.0
[46] RCurl_1.95-4.8 tibble_1.3.0 Formula_1.2-1
[49] cluster_2.0.6 Matrix_1.2-10 data.table_1.10.4
[52] rpart_4.1-11 nnet_7.3-12 compiler_3.4.0
> as.data.frame(colData(dds))
condition sizeFactor replaceable
A_0_T1_R-1-0303.txt A_0_T1_ 0.4740729 TRUE
A_0_T1_R-1-0349.txt A_0_T1_ 1.0197968 TRUE
A_0_T1_R-1-0376.txt A_0_T1_ 1.4085827 TRUE
A_0_T1_R-1-0391.txt A_0_T1_ 1.3654126 TRUE
A_0_T1_R-1-0456.txt A_0_T1_ 1.2789439 TRUE
A_0_T1_R-1-0464.txt A_0_T1_ 1.4638590 TRUE
A_0_T1_R-1-0538.txt A_0_T1_ 1.3184049 TRUE
A_0_T1_R-1-0567.txt A_0_T1_ 1.3049172 TRUE
A_0_T1_R-1-0715.txt A_0_T1_ 0.7848035 TRUE
A_0_T1_R-1-0818.txt A_0_T1_ 0.8989433 TRUE
A_0_T1_R-1-0877.txt A_0_T1_ 1.7172302 TRUE
A_0_T1_R-1-0951.txt A_0_T1_ 1.3992738 TRUE
A_1_T1_R-1-0088.txt A_1_T1_ 0.2441034 TRUE
A_1_T1_R-1-0121.txt A_1_T1_ 1.4030008 TRUE
A_1_T1_R-1-0156.txt A_1_T1_ 1.1589636 TRUE
A_1_T1_R-1-0162.txt A_1_T1_ 1.3313694 TRUE
A_1_T1_R-1-0179.txt A_1_T1_ 0.5823990 TRUE
A_1_T1_R-1-0189.txt A_1_T1_ 0.9231308 TRUE
A_1_T1_R-1-0193.txt A_1_T1_ 0.8648280 TRUE
A_1_T1_R-1-0203.txt A_1_T1_ 1.1520682 TRUE
A_1_T1_R-1-0329.txt A_1_T1_ 0.9307242 TRUE
A_1_T1_R-1-0399.txt A_1_T1_ 1.1902978 TRUE
A_1_T1_R-1-0448.txt A_1_T1_ 0.9559443 TRUE
A_1_T1_R-1-0451.txt A_1_T1_ 1.3852975 TRUE
Links to plots:
Histogram of P values:
https://www.dropbox.com/s/e8rda8qo6a0dlnz/T1_A0vA1.jpeg?dl=0
Plot counts for top 2 DE genes:
https://www.dropbox.com/s/t4ra5ry0c7lt9xu/T1_A0vA1%20721%20plot.jpeg?dl=0 https://www.dropbox.com/s/wfew64ydvtgn5tt/T1_A0vA1%20986%20plot.jpeg?dl=0
Thank you!

Dear Michael,
Thanks very much for your quick response.
Before analysing these data myself in R I used an online platform to perform the same analysis, which uses DESeq2 behind the scenes. The results from the online platform were quite different and the two top hits from my analysis were set to 0/NA.
I got the script from the online platform and the commands were the same. Hence why I was confused that in the same dataset, with the same commands, these genes were filtered out in one analysis but not the other.
Would different versions of DESeq2 result in these differences?
If so is it possible to install an older version of DESeq2 that might be more stringent on letting these genes through?
Thanks again
Possibly. I would just recommend using the latest version of DESeq2 (1.16) and using the filtering step I recommended below.