Hello,
I am running the DTU analysis with DRIMseq.
I was wondering what could be the reasons why I get so many NAs in the results()
(in the lr, pvalue and adj_pvalue columns)?
Out of 161750 features, I get 128038 with NAs (running results with the level="feature"
option). At the gene level, I get 33272 NAs out of 40110 genes).
Does it seem like a "reasonable" proportion?
Can I make it a little less stringent in some way?
Note that I was first filtering with dmFilter()
, which I then removed to see if that made a difference, but it doesn't.
On this data set I first performed the differential expression analysis at the gene-level using DESeq2, and then was asked to perform the differential transcript usage, so I gave DRIMSeq a try.
What worries me is that about half the genes that appear significantly differentially expressed between 2 experimental groups (padj < 0.05 in DESeq2) show NA in the lr, pvalue and adj_pvalue fields, in the DRIMSeq gene-level results table.
I know these are different algorithms, and one should not expect the same results, but this is dramatically different!
Do you have any suggestion on how I can "improve" this, or comments in general about this issue?
These are the steps I run:
# import counts (from SALMON)
txi <- tximport(files, type="salmon", txOut=TRUE,
countsFromAbundance="scaledTPM")
# export count-scale data
cts <- txi$counts
cts <- cts[rowSums(cts) > 0,]
# Build a data frame with the gene ID, transcript ID, and columns for each sample (counts). txdf was created from a gencode gtf
counts <- data.frame(gene_id=txdf$GENEID,
feature_id=txdf$TXNAME,
cts)
# create a dmDSdata object
d <- dmDSdata(counts=counts, samples=sample_table)
# build design matrix
design_full <- model.matrix(~0+Group, data=DRIMSeq::samples(d))
# Estimate precision
set.seed(1)
d <- dmPrecision(d, design=design_full)
# Fit model
d <- dmFit(d, design=design_full, verbose=1)
# hypothesis testing
dtest <- dmTest(d, contrast=mycontrast)
# gene-level results
res <- results(dtest)
# feature/transcript-level results
res.txp <- results(dtest, level = "feature")
Session Info:
R version 4.0.0 (2020-04-24)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Scientific Linux 7.2 (Nitrogen)
Matrix products: default
BLAS/LAPACK: /nfs/software/as/el7.2/EasyBuild/CRG/software/OpenBLAS/0.3.9-GCC-9.3.0/lib/libopenblas_sandybridgep-r0.3.9.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] stageR_1.12.0 SummarizedExperiment_1.20.0
[3] MatrixGenerics_1.2.1 matrixStats_0.59.0
[5] GenomicFeatures_1.42.3 AnnotationDbi_1.52.0
[7] Biobase_2.50.0 GenomicRanges_1.42.0
[9] GenomeInfoDb_1.26.7 IRanges_2.24.1
[11] S4Vectors_0.28.1 BiocGenerics_0.36.1
[13] tximport_1.18.0 DRIMSeq_1.18.0
loaded via a namespace (and not attached):
[1] httr_1.4.2 edgeR_3.32.1 jsonlite_1.7.2
[4] bit64_4.0.5 assertthat_0.2.1 askpass_1.1
[7] BiocFileCache_1.14.0 blob_1.2.1 GenomeInfoDbData_1.2.4
[10] Rsamtools_2.6.0 progress_1.2.2 pillar_1.6.1
[13] RSQLite_2.2.7 lattice_0.20-44 glue_1.4.2
[16] limma_3.46.0 XVector_0.30.0 colorspace_2.0-2
[19] Matrix_1.3-4 plyr_1.8.6 XML_3.99-0.6
[22] pkgconfig_2.0.3 biomaRt_2.46.3 zlibbioc_1.36.0
[25] purrr_0.3.4 scales_1.1.1 BiocParallel_1.24.1
[28] tibble_3.1.2 openssl_1.4.4 generics_0.1.0
[31] ggplot2_3.3.5 ellipsis_0.3.2 cachem_1.0.5
[34] magrittr_2.0.1 crayon_1.4.1 memoise_2.0.0
[37] fansi_0.5.0 MASS_7.3-54 xml2_1.3.2
[40] tools_4.0.0 prettyunits_1.1.1 hms_1.1.0
[43] lifecycle_1.0.0 stringr_1.4.0 munsell_0.5.0
[46] locfit_1.5-9.4 DelayedArray_0.16.3 Biostrings_2.58.0
[49] compiler_4.0.0 rlang_0.4.11 grid_4.0.0
[52] RCurl_1.98-1.3 rstudioapi_0.13 rappdirs_0.3.3
[55] bitops_1.0-7 gtable_0.3.0 DBI_1.1.1
[58] curl_4.3.2 reshape2_1.4.4 R6_2.5.0
[61] GenomicAlignments_1.26.0 dplyr_1.0.7 rtracklayer_1.50.0
[64] fastmap_1.1.0 bit_4.0.4 utf8_1.2.1
[67] readr_1.4.0 stringi_1.6.2 Rcpp_1.0.6
[70] vctrs_0.3.8 dbplyr_2.1.1 tidyselect_1.1.1
Thank you!
Dear Michael,
Thank you for your reply!
I think I got quite confused with the gene-level results: the gene-level results in DRIMSeq do not assess the differential gene expression , but the genes for which there is a differential transcript usage between experimental conditions. And then with
results(dtest, level = "feature")
, you would actually get the details of those transcripts, correct?I of course got also a little confused with DTE versus DTU...
If now I want to assess the differential transcript expression (on Salmon raw counts per transcript), I could use package Swish, as you suggest here ?
Thank you so much.
Yes, that would give transcript level results, and in the rnaseqDTU workflow we recommend use of stageR to control for OFDR within stage-wise testing.
Yes, we do recommend Swish for DTE. It is very straightforward to run after generating inferential replicates from Salmon (with
--numGibbsSamples 20
and I recommend also--thinningFactor 100
to reduce autocorrelation in the Gibbs samples). We've tested Swish across many datasets and I'm happy with its performance on isoform level data, one of the few drawbacks is that it's not very sensitive with minimal samples size experiments (3 vs 3).Great, I will give Swish a try. Thank you again for the prompt replies!!