Hi,
I am very new to RNAseq, bioconductor, and R. So far, I have successfully established my workflow for DGE of my data, using Salmon, to create transcript-level count tables; tximport, and DESeq2 to aggregate and perform gene level analysis. The following code works as expected:
samples <- read.table(file.path("somepath","samples.txt"), header=TRUE)
rownames(samples) <- samples$run
files <- file.path("somepath", paste(samples$run,"quant",sep="_"), "quant.sf")
names(files) <- samples$run
tx2gene <- read_csv(file.path("somepath", "tx2gene.csv")) 
txi <- tximport(files, type = "salmon", tx2gene = tx2gene)
ddsTxi <- DESeqDataSetFromTximport(txi,
                                   colData = samples,
                                   design = ~ condition)
dds <- ddsTxi[rowSums(counts(ddsTxi)) >= 1,]
dds <- DESeq(dds)
I am now trying to run a DTU analysis. Following a published workflow rnaseqDTU, I used
txi <- tximport(files, type="salmon", txOut=TRUE,
            countsFromAbundance="scaledTPM")
to import the Salmon data instead. This also seems to work fine. However, in the documentation for tximport, I found a reference to using the option countsFromAbundance="dtuScaledTPM", which requires also passing the tx2gene dataframe. When I use
txi <- tximport(files, type="salmon", txOut=TRUE, countsFromAbundance="dtuScaledTPM",tx2gene=tx2gene)
instead, I get the following output:
reading in files with read_tsv
1 2 3 4 5 6 7 8 9 10 11 12 Error in medianLengthOverIsoform(length4CFA, tx2gene, ignoreTxVersion, :
all(txId %in% tx2gene$tx) is not TRUE
I'd appreciate some help figuring out what is going on here. It sounds like there is a set of transcripts (txId) that is not fully contained in the set that I provided in tx2gene. I suspect that txId is derived from the quant.sf files generated using Salmon, but it would help to be certain. It would be even better if there was a way to identify the missing transcripts.
On a more general level, I did not find much information regarding when to use scaledTPM and when to use dtuScaledTPM as options (for DTU analysis). Any suggestions are welcome.
Thank you very much,
Derk
> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS  10.14.2
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Users/derk/anaconda3/envs/salmon/lib/R/lib/libRblas.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     
other attached packages:
 [1] fission_1.2.0               apeglm_1.4.2                PoiClaClu_1.0.2.1          
 [4] RColorBrewer_1.1-2          pheatmap_1.0.12             hexbin_1.27.2              
 [7] ggplot2_3.1.0               dplyr_0.7.8                 tximportData_1.10.0        
[10] readr_1.3.1                 tximport_1.10.1             DESeq2_1.22.2              
[13] SummarizedExperiment_1.12.0 DelayedArray_0.8.0          BiocParallel_1.16.5        
[16] matrixStats_0.54.0          Biobase_2.42.0              GenomicRanges_1.34.0       
[19] GenomeInfoDb_1.18.1         IRanges_2.16.0              S4Vectors_0.20.1           
[22] BiocGenerics_0.28.0        
loaded via a namespace (and not attached):
 [1] bitops_1.0-6           bit64_0.9-7            numDeriv_2016.8-1      tools_3.5.1           
 [5] backports_1.1.3        utf8_1.1.4             R6_2.3.0               rpart_4.1-13          
 [9] Hmisc_4.2-0            DBI_1.0.0              lazyeval_0.2.1         colorspace_1.4-0      
[13] nnet_7.3-12            withr_2.1.2            tidyselect_0.2.5       gridExtra_2.3         
[17] bit_1.1-14             compiler_3.5.1         cli_1.0.1              htmlTable_1.13.1      
[21] labeling_0.3           scales_1.0.0           checkmate_1.9.1        genefilter_1.64.0     
[25] stringr_1.3.1          digest_0.6.18          foreign_0.8-71         XVector_0.22.0        
[29] base64enc_0.1-3        pkgconfig_2.0.2        htmltools_0.3.6        bbmle_1.0.20          
[33] htmlwidgets_1.3        rlang_0.3.1            rstudioapi_0.9.0       RSQLite_2.1.1         
[37] bindr_0.1.1            jsonlite_1.6           acepack_1.4.1          RCurl_1.95-4.11       
[41] magrittr_1.5           GenomeInfoDbData_1.2.0 Formula_1.2-3          Matrix_1.2-15         
[45] Rcpp_1.0.0             munsell_0.5.0          fansi_0.4.0            stringi_1.2.4         
[49] yaml_2.2.0             MASS_7.3-51.1          zlibbioc_1.28.0        plyr_1.8.4            
[53] grid_3.5.1             blob_1.1.1             crayon_1.3.4           lattice_0.20-38       
[57] splines_3.5.1          annotate_1.60.0        hms_0.4.2              locfit_1.5-9.1        
[61] knitr_1.21             pillar_1.3.1           geneplotter_1.60.0     reshape2_1.4.3        
[65] XML_3.98-1.16          glue_1.3.0             latticeExtra_0.6-28    data.table_1.12.0     
[69] BiocManager_1.30.4     gtable_0.2.0           purrr_0.3.0            assertthat_0.2.0      
[73] emdbook_1.3.10         xfun_0.4               xtable_1.8-3           coda_0.19-2           
[77] survival_2.43-3        tibble_2.0.1           AnnotationDbi_1.44.0   memoise_1.1.0         
[81] bindrcpp_0.2.2         cluster_2.0.7-1

Michael,
thanks for your quick response. I used Salmon to create an index of the reference transcriptome for S. purpuratus hosted on Echinobase (link. My tx2name dataframe looks like this:
I poked around my files in the meantime and realized that for some reason 3 transcripts went missing from my csv file. I added these back in and things started working again. I'd still be glad if you could comment on the use case for
countsFromAbundance="dtuScaledTPM"vscountsFromAbundance="scaledTPM".In the meantime, I encountered another issue further downstream, with stageR. I'll put this in a different question.
Thank you again, Derk
I didn’t benchmark dtuScaledTPM but added to tximport to allow for later benchmarking. I think the technical description is in ?tximport. dtuScaledTPM should be closer to the original counts than scaledTPM which is desirable for information about the precision to propagate to statistical methods.