DRIMSeq: many NAs in results compared to DESeq2 ?
1
0
Entering edit mode
sbcn ▴ 80
@sbcn-4752
Last seen 20 months ago
Spain

Hello,

I am running the DTU analysis with DRIMseq.

I was wondering what could be the reasons why I get so many NAs in the results() (in the lr, pvalue and adj_pvalue columns)?

Out of 161750 features, I get 128038 with NAs (running results with the level="feature" option). At the gene level, I get 33272 NAs out of 40110 genes). Does it seem like a "reasonable" proportion?

Can I make it a little less stringent in some way?

Note that I was first filtering with dmFilter(), which I then removed to see if that made a difference, but it doesn't.

On this data set I first performed the differential expression analysis at the gene-level using DESeq2, and then was asked to perform the differential transcript usage, so I gave DRIMSeq a try.

What worries me is that about half the genes that appear significantly differentially expressed between 2 experimental groups (padj < 0.05 in DESeq2) show NA in the lr, pvalue and adj_pvalue fields, in the DRIMSeq gene-level results table.

I know these are different algorithms, and one should not expect the same results, but this is dramatically different!

Do you have any suggestion on how I can "improve" this, or comments in general about this issue?

These are the steps I run:

# import counts (from SALMON)
txi <- tximport(files, type="salmon", txOut=TRUE,
                countsFromAbundance="scaledTPM")

# export count-scale data
cts <- txi$counts
cts <- cts[rowSums(cts) > 0,]

# Build a data frame with the gene ID, transcript ID, and columns for each sample (counts). txdf was created from a gencode gtf
counts <- data.frame(gene_id=txdf$GENEID,
                     feature_id=txdf$TXNAME,
                     cts)

# create a dmDSdata object
d <- dmDSdata(counts=counts, samples=sample_table)

# build design matrix
design_full <- model.matrix(~0+Group, data=DRIMSeq::samples(d))

# Estimate precision
set.seed(1)
d <- dmPrecision(d, design=design_full)

# Fit model
d <- dmFit(d, design=design_full, verbose=1)

# hypothesis testing
 dtest <- dmTest(d, contrast=mycontrast)

# gene-level results
res <- results(dtest)
# feature/transcript-level results
res.txp <- results(dtest, level = "feature")

Session Info:

R version 4.0.0 (2020-04-24)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Scientific Linux 7.2 (Nitrogen)

Matrix products: default
BLAS/LAPACK: /nfs/software/as/el7.2/EasyBuild/CRG/software/OpenBLAS/0.3.9-GCC-9.3.0/lib/libopenblas_sandybridgep-r0.3.9.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] stageR_1.12.0               SummarizedExperiment_1.20.0
 [3] MatrixGenerics_1.2.1        matrixStats_0.59.0         
 [5] GenomicFeatures_1.42.3      AnnotationDbi_1.52.0       
 [7] Biobase_2.50.0              GenomicRanges_1.42.0       
 [9] GenomeInfoDb_1.26.7         IRanges_2.24.1             
[11] S4Vectors_0.28.1            BiocGenerics_0.36.1        
[13] tximport_1.18.0             DRIMSeq_1.18.0             

loaded via a namespace (and not attached):
 [1] httr_1.4.2               edgeR_3.32.1             jsonlite_1.7.2          
 [4] bit64_4.0.5              assertthat_0.2.1         askpass_1.1             
 [7] BiocFileCache_1.14.0     blob_1.2.1               GenomeInfoDbData_1.2.4  
[10] Rsamtools_2.6.0          progress_1.2.2           pillar_1.6.1            
[13] RSQLite_2.2.7            lattice_0.20-44          glue_1.4.2              
[16] limma_3.46.0             XVector_0.30.0           colorspace_2.0-2        
[19] Matrix_1.3-4             plyr_1.8.6               XML_3.99-0.6            
[22] pkgconfig_2.0.3          biomaRt_2.46.3           zlibbioc_1.36.0         
[25] purrr_0.3.4              scales_1.1.1             BiocParallel_1.24.1     
[28] tibble_3.1.2             openssl_1.4.4            generics_0.1.0          
[31] ggplot2_3.3.5            ellipsis_0.3.2           cachem_1.0.5            
[34] magrittr_2.0.1           crayon_1.4.1             memoise_2.0.0           
[37] fansi_0.5.0              MASS_7.3-54              xml2_1.3.2              
[40] tools_4.0.0              prettyunits_1.1.1        hms_1.1.0               
[43] lifecycle_1.0.0          stringr_1.4.0            munsell_0.5.0           
[46] locfit_1.5-9.4           DelayedArray_0.16.3      Biostrings_2.58.0       
[49] compiler_4.0.0           rlang_0.4.11             grid_4.0.0              
[52] RCurl_1.98-1.3           rstudioapi_0.13          rappdirs_0.3.3          
[55] bitops_1.0-7             gtable_0.3.0             DBI_1.1.1               
[58] curl_4.3.2               reshape2_1.4.4           R6_2.5.0                
[61] GenomicAlignments_1.26.0 dplyr_1.0.7              rtracklayer_1.50.0      
[64] fastmap_1.1.0            bit_4.0.4                utf8_1.2.1              
[67] readr_1.4.0              stringi_1.6.2            Rcpp_1.0.6              
[70] vctrs_0.3.8              dbplyr_2.1.1             tidyselect_1.1.1

Thank you!

DESeq2 DRIMSeq dtu RNASeq • 1.1k views
ADD COMMENT
2
Entering edit mode
@mikelove
Last seen 34 minutes ago
United States

It's not just that these are different algorithms, e.g. DESeq2 vs edgeR or limma.

These are different scientific questions. A gene can be DE and not DTU. And a gene can be DTU but not DE. Or it could be both. But I wouldn't be surprised if these are not the same number for a given dataset.

See this graphic from "RNA sequencing data: hitchhiker's guide to expression analysis":

ADD COMMENT
0
Entering edit mode

Dear Michael,

Thank you for your reply!

I think I got quite confused with the gene-level results: the gene-level results in DRIMSeq do not assess the differential gene expression , but the genes for which there is a differential transcript usage between experimental conditions. And then with results(dtest, level = "feature") , you would actually get the details of those transcripts, correct?

I of course got also a little confused with DTE versus DTU...

If now I want to assess the differential transcript expression (on Salmon raw counts per transcript), I could use package Swish, as you suggest here ?

Thank you so much.

ADD REPLY
1
Entering edit mode

Yes, that would give transcript level results, and in the rnaseqDTU workflow we recommend use of stageR to control for OFDR within stage-wise testing.

Yes, we do recommend Swish for DTE. It is very straightforward to run after generating inferential replicates from Salmon (with --numGibbsSamples 20 and I recommend also --thinningFactor 100 to reduce autocorrelation in the Gibbs samples). We've tested Swish across many datasets and I'm happy with its performance on isoform level data, one of the few drawbacks is that it's not very sensitive with minimal samples size experiments (3 vs 3).

ADD REPLY
0
Entering edit mode

Great, I will give Swish a try. Thank you again for the prompt replies!!

ADD REPLY

Login before adding your answer.

Traffic: 659 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6