Question

tximport: Error in medianLengthOverIsoform

0

Entering edit mode

d-joester ▴ 20

@d-joester-19665

Last seen 6.9 years ago

Hi,

I am very new to RNAseq, bioconductor, and R. So far, I have successfully established my workflow for DGE of my data, using Salmon, to create transcript-level count tables; tximport, and DESeq2 to aggregate and perform gene level analysis. The following code works as expected:

samples <- read.table(file.path("somepath","samples.txt"), header=TRUE)
rownames(samples) <- samples$run

files <- file.path("somepath", paste(samples$run,"quant",sep="_"), "quant.sf")
names(files) <- samples$run

tx2gene <- read_csv(file.path("somepath", "tx2gene.csv")) 

txi <- tximport(files, type = "salmon", tx2gene = tx2gene)

ddsTxi <- DESeqDataSetFromTximport(txi,
                                   colData = samples,
                                   design = ~ condition)

dds <- ddsTxi[rowSums(counts(ddsTxi)) >= 1,]
dds <- DESeq(dds)

I am now trying to run a DTU analysis. Following a published workflow rnaseqDTU, I used

txi <- tximport(files, type="salmon", txOut=TRUE,
            countsFromAbundance="scaledTPM")

to import the Salmon data instead. This also seems to work fine. However, in the documentation for tximport, I found a reference to using the option countsFromAbundance="dtuScaledTPM", which requires also passing the tx2gene dataframe. When I use

txi <- tximport(files, type="salmon", txOut=TRUE, countsFromAbundance="dtuScaledTPM",tx2gene=tx2gene)

instead, I get the following output:

reading in files with read_tsv

1 2 3 4 5 6 7 8 9 10 11 12 Error in medianLengthOverIsoform(length4CFA, tx2gene, ignoreTxVersion, :
all(txId %in% tx2gene$tx) is not TRUE

I'd appreciate some help figuring out what is going on here. It sounds like there is a set of transcripts (txId) that is not fully contained in the set that I provided in tx2gene. I suspect that txId is derived from the quant.sf files generated using Salmon, but it would help to be certain. It would be even better if there was a way to identify the missing transcripts.

On a more general level, I did not find much information regarding when to use scaledTPM and when to use dtuScaledTPM as options (for DTU analysis). Any suggestions are welcome.

Thank you very much,

Derk

> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS  10.14.2

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Users/derk/anaconda3/envs/salmon/lib/R/lib/libRblas.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] fission_1.2.0               apeglm_1.4.2                PoiClaClu_1.0.2.1          
 [4] RColorBrewer_1.1-2          pheatmap_1.0.12             hexbin_1.27.2              
 [7] ggplot2_3.1.0               dplyr_0.7.8                 tximportData_1.10.0        
[10] readr_1.3.1                 tximport_1.10.1             DESeq2_1.22.2              
[13] SummarizedExperiment_1.12.0 DelayedArray_0.8.0          BiocParallel_1.16.5        
[16] matrixStats_0.54.0          Biobase_2.42.0              GenomicRanges_1.34.0       
[19] GenomeInfoDb_1.18.1         IRanges_2.16.0              S4Vectors_0.20.1           
[22] BiocGenerics_0.28.0        

loaded via a namespace (and not attached):
 [1] bitops_1.0-6           bit64_0.9-7            numDeriv_2016.8-1      tools_3.5.1           
 [5] backports_1.1.3        utf8_1.1.4             R6_2.3.0               rpart_4.1-13          
 [9] Hmisc_4.2-0            DBI_1.0.0              lazyeval_0.2.1         colorspace_1.4-0      
[13] nnet_7.3-12            withr_2.1.2            tidyselect_0.2.5       gridExtra_2.3         
[17] bit_1.1-14             compiler_3.5.1         cli_1.0.1              htmlTable_1.13.1      
[21] labeling_0.3           scales_1.0.0           checkmate_1.9.1        genefilter_1.64.0     
[25] stringr_1.3.1          digest_0.6.18          foreign_0.8-71         XVector_0.22.0        
[29] base64enc_0.1-3        pkgconfig_2.0.2        htmltools_0.3.6        bbmle_1.0.20          
[33] htmlwidgets_1.3        rlang_0.3.1            rstudioapi_0.9.0       RSQLite_2.1.1         
[37] bindr_0.1.1            jsonlite_1.6           acepack_1.4.1          RCurl_1.95-4.11       
[41] magrittr_1.5           GenomeInfoDbData_1.2.0 Formula_1.2-3          Matrix_1.2-15         
[45] Rcpp_1.0.0             munsell_0.5.0          fansi_0.4.0            stringi_1.2.4         
[49] yaml_2.2.0             MASS_7.3-51.1          zlibbioc_1.28.0        plyr_1.8.4            
[53] grid_3.5.1             blob_1.1.1             crayon_1.3.4           lattice_0.20-38       
[57] splines_3.5.1          annotate_1.60.0        hms_0.4.2              locfit_1.5-9.1        
[61] knitr_1.21             pillar_1.3.1           geneplotter_1.60.0     reshape2_1.4.3        
[65] XML_3.98-1.16          glue_1.3.0             latticeExtra_0.6-28    data.table_1.12.0     
[69] BiocManager_1.30.4     gtable_0.2.0           purrr_0.3.0            assertthat_0.2.0      
[73] emdbook_1.3.10         xfun_0.4               xtable_1.8-3           coda_0.19-2           
[77] survival_2.43-3        tibble_2.0.1           AnnotationDbi_1.44.0   memoise_1.1.0         
[81] bindrcpp_0.2.2         cluster_2.0.7-1

tximport deseq2 dexseq • 2.5k views

ADD COMMENT • link 7.0 years ago d-joester ▴ 20

score 0 · Answer 1 · 2019-01-30

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 5 hours ago

United States

It's required that you provide the gene ID for all of the transcripts in the Salmon quants. And it has to be exact, or you have to use some of the arguments that specify what should be stripped from the Salmon transcript ID to make it match your tx2gene table. Can you show tx2gene? And what transcripts did you use with Salmon?

ADD COMMENT • link 7.0 years ago Michael Love 43k

0

Entering edit mode

Michael,

thanks for your quick response. I used Salmon to create an index of the reference transcriptome for S. purpuratus hosted on Echinobase (link. My tx2name dataframe looks like this:

> head(tx2gene)

# A tibble: 6 x 3
  TXNAME         GENEID       ntx        
  <chr>          <chr>        <S3: table>
1 WHL22.324130.0 WHL22.324130 1          
2 WHL22.324131.1 WHL22.324131 2          
3 WHL22.224727.1 WHL22.224727 2          
4 WHL22.100119.0 WHL22.100119 1          
5 WHL22.100252.0 WHL22.100252 1          
6 WHL22.100256.0 WHL22.100256 1

I poked around my files in the meantime and realized that for some reason 3 transcripts went missing from my csv file. I added these back in and things started working again. I'd still be glad if you could comment on the use case for countsFromAbundance="dtuScaledTPM" vs countsFromAbundance="scaledTPM".

In the meantime, I encountered another issue further downstream, with stageR. I'll put this in a different question.

Thank you again, Derk

ADD REPLY • link 7.0 years ago d-joester ▴ 20

0

Entering edit mode

I didn’t benchmark dtuScaledTPM but added to tximport to allow for later benchmarking. I think the technical description is in ?tximport. dtuScaledTPM should be closer to the original counts than scaledTPM which is desirable for information about the precision to propagate to statistical methods.

ADD REPLY • link 7.0 years ago Michael Love 43k

score 0 · Answer 2 · 2019-01-30

0

Entering edit mode

d-joester ▴ 20

@d-joester-19665

Last seen 6.9 years ago

Poking around, I found that

setdiff(rownames(cts),tx2gene$TXNAME)

revealed that tx2gene was missing three transcripts. I added these to my csv file and the issue disappeared.

ADD COMMENT • link 7.0 years ago d-joester ▴ 20