I started this question over at Biostars (https://www.biostars.org/p/341052/) , and I did get help and applied some suggestions from the community over there. However, I still have some remaining doubts, and maybe I should have come here first since it seems it is indeed an issue of how I'm using DESeq2 rather than previous points in the RNAseq analysis.
Quick summary: We are working on a non-model organism, with no genome or transcriptome databases. We are analyzing 4 tissues under 2 treatments, each one in triplicate. The pipeline goes as such: trinity -> salmon -> tximport -> deseq2. We want to use a common pipeline for both transcript and 'gene' level, so that's the reason behind the choice of DESeq2. Below is creating the object for transcript level comparisons.
txi.salmon <- tximport(files, type= 'salmon', countsFromAbundance= 'scaledTPM', txOut=TRUE, dropInfReps=TRUE) colDesign <- data.frame(row.names=samples$sample, tissue=samples$tissue, diet=samples$diet, replicate=samples$replicate ) dds <- DESeqDataSetFromTximport(txi.salmon, colData=colDesign, design= ~tissue+diet) dds$group <- factor(paste0(dds$diet,dds$tissue)) design(dds) <- ~group
The odd thing about the results comes when looking at the fold changes and right-skewed hill p-value distributions of the contrasts. Before any correction I was getting plots that showed that no small log2 fold change, usually around -2 to 2, was flagged as significant, and also getting some truly remarkable fold changes of 50 or so. By inspection of the PCA plot, one of the suggestions was to rerun without samples from one of the tissues as it clusters far from the other three tissues. The new PCA shows the two treatments do not have the same dispersion, with C clustering tightly and B more spread out.
I read that DESeq2's single dispersion parameter might not be adequate in this situation, and to use the fdrtool package to re estimate the null model. This did flatten out the expected p-value distributions (my call of contrast does not use the lfcThreshold and altHypothesis arguments), but it drastically elevated the number of transcripts identified as significant (sometimes from a few hundreds to tens of thousands). Thinking the low count non-informative transcripts from the assembly are contributing to these changes, I wanted to apply the shrinkage to the log folds. The TPM of the vast majority of transcripts in the assembly is indeed below 10. I ended up using betaPrior=TRUE when running DESeq because for some reason using lfcshrink() on the extracted results never finished. I understand this changes how the p-values are computed.
In essence, I want to know if it is statistically sound to use these different modifications together:
- shrinkage to control for the wide changes in low count transcripts
- alternative hypothesis to test for biologically relevant fold change rather than the small, but significant ones with the default null hypothesis
- re-estimating the null model due to differences in dispersion between the two treatments.
I don't know if it would be overcorrecting too much, or the type of results I'm seeing is expected with DESeq2 given the type of dataset I'm working with.
Also, the output of sessionInfo()
R version 3.4.4 (2018-03-15) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 18.04.1 LTS Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1 locale:  LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8  LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C  LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages:  parallel stats4 stats graphics grDevices utils datasets methods base other attached packages:  fdrtool_1.2.15 geneplotter_1.56.0 annotate_1.56.2 XML_3.98-1.16 AnnotationDbi_1.40.0  lattice_0.20-35 ggplot2_3.0.0 BiocParallel_1.12.0 readr_1.1.1 tximport_1.6.0  DESeq2_1.18.1 SummarizedExperiment_1.8.1 DelayedArray_0.4.1 matrixStats_0.54.0 Biobase_2.38.0  GenomicRanges_1.30.3 GenomeInfoDb_1.14.0 IRanges_2.12.0 S4Vectors_0.16.0 BiocGenerics_0.24.0 loaded via a namespace (and not attached):  bit64_0.9-7 splines_3.4.4 Formula_1.2-3 assertthat_0.2.0 latticeExtra_0.6-28 blob_1.1.1  GenomeInfoDbData_1.0.0 yaml_2.2.0 pillar_1.3.0 RSQLite_2.1.1 backports_1.1.2 glue_1.3.0  digest_0.6.17 RColorBrewer_1.1-2 XVector_0.18.0 checkmate_1.8.5 colorspace_1.3-2 htmltools_0.3.6  Matrix_1.2-14 plyr_1.8.4 pkgconfig_2.0.2 genefilter_1.60.0 zlibbioc_1.24.0 purrr_0.2.5  xtable_1.8-3 scales_1.0.0 htmlTable_1.12 tibble_1.4.2 withr_2.1.2 nnet_7.3-12  lazyeval_0.2.1 survival_2.42-6 magrittr_1.5 crayon_1.3.4 memoise_1.1.0 foreign_0.8-71  tools_3.4.4 data.table_1.11.8 hms_0.4.2 stringr_1.3.1 locfit_1.5-9.1 munsell_0.5.0  cluster_2.0.7-1 bindrcpp_0.2.2 compiler_3.4.4 rlang_0.2.2 grid_3.4.4 RCurl_1.95-4.11  rstudioapi_0.8 htmlwidgets_1.3 bitops_1.0-6 base64enc_0.1-3 gtable_0.2.0 DBI_1.0.0  R6_2.3.0 gridExtra_2.3 knitr_1.20 dplyr_0.7.6 bit_1.1-14 bindr_0.1.1  Hmisc_4.1-1 stringi_1.2.4 Rcpp_0.12.19 rpart_4.1-13 acepack_1.4.1 tidyselect_0.2.4