I've tried to find if anyone else has seen this, but none describe what I'm seeing. I have a pretty simple/small data set: 6 groups, 3 reps/group (2 ctrl and 4 exptl). Read data were mapped to a Trinity assembly w/bt2 and the expected counts generated with RSEM (gene_results) were used to generate the matrix.
I like to bring over the normalized counts and group average fpkms (calc from RSEM fpkms) to use for downstream filtering/decisions as to which genes to pursue. In this dataset there are usually one or two with high log2 and low FDR, but with no or little read evidence. I'd like to know why this is happening and how to prevent it. Pehaps there is something to add to my code that will prevent it? I have also run the analysis with and without pre-filtering the dds and i see the same thing. I have also checked (several times) import of the norm'd read and avg fpkm values, so no mistake there.
Finally, exploratory analysis of this dataset shows that there is a HUGE amount of variance and PCA looks terrible... so there's that.
If I'm missing something or if anyone has ideas or suggestions as to how to mitigate this, I'd love to know.
My core DESeq2 code is shown below the these outputs.
gene_id | baseMean | log2FoldChange | lfcSE | stat | pvalue | padj | CTL_FPKM_AVG | PA_FPKM_AVG | CTL1 | CTL2 | CTL3 | PA1 | PA2 | PA3 |
TRINITY_DN79696_c1_g4 | 43.9 | 28.7 | 5.3 | 5.4 | 5.31E-008 | 2.08E-004 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
TRINITY_DN78128_c3_g1 | 260.8 | 25.8 | 4.1 | 6.3 | 3.47E-010 | 1.92E-006 | 30.7 | 0.0 | 673.2 | 643.0 | 886.2 | 0.0 | 0.0 | 0.0 |
gene_id | baseMean | log2FoldChange | lfcSE | stat | pvalue | padj | CTL_FPKM_AVG | KS_FPKM_AVG | CTL1 | CTL2 | CTL3 | KS1 | KS2 | KS3 |
TRINITY_DN71439_c1_g5 | 32.6 | 29.0 | 5.3 | 5.5 | 3.86E-008 | 1.10E-004 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
TRINITY_DN68768_c2_g1 | 19.7 | 26.8 | 5.3 | 5.1 | 3.69E-007 | 7.17E-004 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
TRINITY_DN85457_c12_g6 | 22.3 | 25.0 | 5.3 | 4.8 | 2.02E-006 | 2.62E-003 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
TRINITY_DN69591_c4_g1 | 137.2 | 24.5 | 5.1 | 4.8 | 1.75E-006 | 2.41E-003 | 30.5 | 0.0 | 468.2 | 0.0 | 578.2 | 0.0 | 0.0 | 0.0 |
DESeq2 code is as follows:
COUNTS <- as.matrix(read.table(file="count_matrix.txt", sep="\t", header=TRUE, row.names=1))
COUNTS_rd<-round(COUNTS)
mode(COUNTS_rd) <- "integer"
META <- read.table(file ="metadata.txt", sep="\t", header=TRUE, row.names=1)
META
sample_id | Comparison |
CTL1 | CTL |
CTL2 | CTL |
CTL3 | CTL |
CTLAG1 | CTLAG |
CTLAG2 | CTLAG |
CTLAG3 | CTLAG |
PA1 | PA |
PA2 | PA |
PA3 | PA |
PAAG1 | PAAG |
PAAG2 | PAAG |
PAAG3 | PAAG |
KS1 | KS |
KS2 | KS |
KS3 | KS |
KSAG1 | KSAG |
KSAG2 | KSAG |
KSAG3 | KSAG |
dds <- DESeqDataSetFromMatrix(countData=COUNTS_rd, colData=META, design = ~Comparison)
nrow(dds) #rows = 136177
idx <- rowSums( counts(dds, normalized=TRUE) >= 5 ) >= 3 # prefiltering
dds <- dds[idx,]
nrow(dds) ##rows = 74283
dds <- DESeq(dds)
resultsNames(dds)
counts_norm <- counts(dds,normalized=TRUE)
write.table(counts_norm,file="normed_counts.txt",row.names=TRUE,col.names=TRUE,quote=FALSE,sep="\t")
res1 <- results(dds, contrast=c("Comparison","CTL", "PA")) # first one of four comparisons.
Not sure if it's needed, but here's my session info:
> sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS
Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.0
LAPACK: /usr/lib/lapack/liblapack.so.3.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] vsn_3.44.0 ggplot2_2.2.1
[3] dplyr_0.7.2 amap_0.8-14
[5] ape_4.1 genefilter_1.58.1
[7] PoiClaClu_1.0.2 RColorBrewer_1.1-2
[9] pheatmap_1.0.8 DESeq2_1.16.1
[11] SummarizedExperiment_1.6.3 DelayedArray_0.2.7
[13] matrixStats_0.52.2 Biobase_2.36.2
[15] GenomicRanges_1.28.4 GenomeInfoDb_1.12.2
[17] IRanges_2.10.2 S4Vectors_0.14.3
[19] BiocGenerics_0.22.0
loaded via a namespace (and not attached):
[1] bit64_0.9-7 splines_3.4.2 Formula_1.2-2
[4] assertthat_0.2.0 affy_1.54.0 latticeExtra_0.6-28
[7] blob_1.1.0 GenomeInfoDbData_0.99.0 RSQLite_2.0
[10] backports_1.1.0 lattice_0.20-35 limma_3.32.3
[13] glue_1.1.1 digest_0.6.12 XVector_0.16.0
[16] checkmate_1.8.3 colorspace_1.3-2 preprocessCore_1.38.1
[19] htmltools_0.3.6 Matrix_1.2-10 plyr_1.8.4
[22] pkgconfig_2.0.1 XML_3.98-1.9 zlibbioc_1.22.0
[25] xtable_1.8-2 scales_0.4.1 affyio_1.46.0
[28] BiocParallel_1.10.1 htmlTable_1.9 tibble_1.3.3
[31] annotate_1.54.0 nnet_7.3-12 lazyeval_0.2.0
[34] survival_2.41-3 magrittr_1.5 memoise_1.1.0
[37] nlme_3.1-131 foreign_0.8-69 BiocInstaller_1.26.1
[40] tools_3.4.2 data.table_1.10.4 stringr_1.2.0
[43] munsell_0.4.3 locfit_1.5-9.1 cluster_2.0.6
[46] AnnotationDbi_1.38.1 bindrcpp_0.2 compiler_3.4.2
[49] rlang_0.1.1 grid_3.4.2 RCurl_1.95-4.8
[52] htmlwidgets_0.9 labeling_0.3 bitops_1.0-6
[55] base64enc_0.1-3 gtable_0.2.0 DBI_0.7
[58] R6_2.2.2 gridExtra_2.2.1 knitr_1.16
[61] bit_1.1-12 bindr_0.1 Hmisc_4.0-3
[64] stringi_1.1.5 Rcpp_0.12.12 geneplotter_1.54.0
[67] rpart_4.1-11 acepack_1.4.1
Thanks, Michael. Just sent you an e-mail with DropBox link to data.
I took a look, and for these genes there are large counts combined with 0's, within a single group, which are enough of a problem for the negative binomial fit, and then there are not enough samples to allow for the outlier removal techniques. The bad fit is throwing off the Wald statistic.
The problem can be avoided if you use likelihood ratio tests instead here.
So you can run the following code one time, then re-use 'dds' for multiple comparisons:
Then for example to contrast PA vs CTL, you would use the following code:
The LFC will still be large, but the p-values will be appropriate.
For comparing against CTLAG, you just re-level (you don't need to re-run dispersion estimation):
Thank you for the code, I'll try it.
So, if I understand what you're saying (maybe not), the first gene in my example has other matrix groups (not being compared) with norm count values of 0 1 300, 0 57 0 and 0 133 1 and it is these groups and their lack of sufficient replication (is this impacting/preventing proper Cook's filtering?) that force a bad fit for this particular gene in the comparison groups? And, when I see this occur, I should handle those affected group comparisons independently/differently as in above and relevel for all others?
The mix of zeros and large counts within groups, and the small replicate number means I would use the LRT for all comparisons in this dataset.
Also recommend LRT for similarly highly noisy datasets of similar small replicate number in the future.