Question: Using tximport on rsem files produces error while trying to read in files
0
19 months ago by
stephan.emmrich0 wrote:

Hi,

I am trying to import RSEM output files into tximport. Using this code

sTPM_hsa <- tximport(files = rsem_files, type = "rsem", txIn = TRUE, txOut = FALSE, countsFromAbundance = "scaledTPM")

with

rsem_files <- list.files(sTPM_basedir, pattern = "genes", full.names = TRUE)
rsem_files <- rsem_files[file.info(rsem_files)$isdir] rsem_files <- paste0(rsem_files, "/sTPM.sf") rsem_files <- rsem_files[file.exists(rsem_files)] names(rsem_files) <- basename(gsub("/sTPM.sf", "", rsem_files)) I get this: Error in if (is.null(importer) & !kallisto.h5) { : missing value where TRUE/FALSE needed That points to line 168 in tximport.R, but I don"t get the problem?! I have the readr package installed, also even if not it should use another function to read in the files... I could not find this issue in any related post, hopefully its not a too stupid mistake... Appreciate any help! Thank You, Stephan rnaseq R tpm tximport rsem • 1.0k views ADD COMMENTlink modified 19 months ago • written 19 months ago by stephan.emmrich0 I have 1.6. just downloaded today from Bioconductor ADD REPLYlink written 19 months ago by stephan.emmrich0 Thanks! When I do this, then > install_github(“mikelove/tximport”) Error: unexpected input in "install_github(“" I am using RStudio1.1.442 with R3.4.4 on a windows 7 PC, could windows be the problem here? ADD REPLYlink written 19 months ago by stephan.emmrich0 It may be curly quotes? Try typing it out instead of copy paste ADD REPLYlink written 19 months ago by Michael Love26k Answer: Using tximport on rsem files produces error while trying to read in files 0 19 months ago by Michael Love26k United States Michael Love26k wrote: What version of tximport do you have? ADD COMMENTlink written 19 months ago by Michael Love26k Can you try this and see if it fixes the error: library(devtools) install_github(“mikelove/tximport”) Then restart R ADD REPLYlink written 19 months ago by Michael Love26k Great, that works! Now I have tximport 1.7.13. But sadly that gives me same error as from the original post with that commands: Error in if (is.null(importer) & !kallisto.h5) { : missing value where TRUE/FALSE needed  Weirdly, when I run a sample to try out tximport v.1.7.13: sTPM_hsa_TEST <- tximport(files = "./sTPM/Blueprint/C000WYB3.gene_quantification.rsem_grape2_crg.GRCh38.20150622.RESULTS", type = "rsem", countsFromAbundance = "scaledTPM") it gives this error: reading in files with read_tsv 1 Warning: 120966 parsing failures. row # A tibble: 5 x 5 col row col expected actual file expected <int> <chr> <chr> <chr> <chr> actual 1 1 length no trailing characters .10 './sTPM/Blueprint/C000WYB3.gene_quantification.rsem_grape2_crg~ file 2 1 NA 8 columns 15 columns './sTPM/Blueprint/C000WYB3.gene_quantification.rsem_grape2_crg~ row 3 2 length no trailing characters .50 './sTPM/Blueprint/C000WYB3.gene_quantification.rsem_grape2_crg~ col 4 2 NA 8 columns 15 columns './sTPM/Blueprint/C000WYB3.gene_quantification.rsem_grape2_crg~ expected  But when I run that same sample code to try out tximport v.1.6.00, then the output has a "no" for every row in the countsFromAbundance column!? So the simple one-sample tximport run with a local path gets a parsing error or does not make scaledTPMs, depending on version. The complex tximport run with multiple files gets stuck before anything is even read in. Maybe a problem with my variables or pathnames? Here again everything I typed for the multiple files run from your paper: basedir <- "E:/work/Dropbox/Bioware/Rochester/HSC/RNA Seq/R" sTPM_basedir <- paste0(basedir, "/sTPM/Blueprint") rsem_files <- list.files(sTPM_basedir, pattern = "genes", full.names = TRUE) rsem_files <- rsem_files[file.info(rsem_files)$isdir]
rsem_files <- paste0(rsem_files, "/sTPM.sf")
rsem_files <- rsem_files[file.exists(rsem_files)]
names(rsem_files) <- basename(gsub("/sTPM.sf", "", rsem_files))

sTPM_hsa <- tximport(files = rsem_files, type = "rsem", txIn = TRUE, txOut = FALSE, countsFromAbundance = "scaledTPM", readLength = 100)

Error in if (is.null(importer) & !kallisto.h5) { :
missing value where TRUE/FALSE needed

I’m a bit lost on how this error is coming up. Tell me more about these files. They are isoform level results from RSEM (because you have txIn=TRUE)?

ADD REPLYlink modified 19 months ago • written 19 months ago by Michael Love26k

The files are from Blueprint DCC portal, primary analysis done by Grape2 pipeline (https://github.com/guigolab/grape-nf).

This is the header and first 5 rows of such a file (total 30).

 gene_id transcript_id(s) length effective_length expected_count TPM FPKM posterior_mean_count posterior_standard_deviation_of_count pme_TPM pme_FPKM TPM_ci_lower_bound TPM_ci_upper_bound FPKM_ci_lower_bound FPKM_ci_upper_bound ENSG00000000003.13 ENST00000373020.7,ENST00000494424.1,ENST00000496771.4,ENST00000612152.3,ENST00000614008.3 2074.1 1837.69 885 3.65 4.35 885 0 3.64 4.38 3.33516 3.96303 4.02217 4.77881 ENSG00000000005.5 ENST00000373031.4,ENST00000485971.1 940.5 704.26 0 0 0 0 0 0.03 0.04 0.000306 0.08152 0.000426 0.098299 ENSG00000000419.11 ENST00000371582.7,ENST00000371584.7,ENST00000371588.8,ENST00000413082.1,ENST00000466152.4,ENST00000494752.1 1079.47 843.07 5697 51.15 61.01 5697 0 50.66 61.02 49.2778 51.9333 59.392 62.5898 ENSG00000000457.12 ENST00000367770.4,ENST00000367771.9,ENST00000367772.7,ENST00000423670.1,ENST00000470238.1 1538 1301.6 16 0.09 0.11 16.01 0.11 0.11 0.13 0.06088 0.154207 0.074289 0.186764 ENSG00000000460.15 ENST00000286031.9,ENST00000359326.7,ENST00000413811.3,ENST00000459772.4,ENST00000466580.5,ENST00000472795.4,ENST00000481744.4,ENST00000496973.4,ENST00000498289.4 2066.69 1830.32 659 2.73 3.25 658.99 0.11 2.78 3.35 2.40965 3.16032 2.90657 3.81134 ENSG00000000938.11 ENST00000374003.6,ENST00000374004.4,ENST00000374005.6,ENST00000399173.4,ENST00000457296.4,ENST00000468038.1,ENST00000475472.4 2729 2492.6 4 0.01 0.01 4 0 0.07 0.08 0.023157 0.115701 0.027981 0.139542

Actually I am not interested in isoform quantitation so far, as I want to compare to two other species, with gene symbol as common identifier, Due to multimapped reads fro the non-model organism isoform comparisions would be difficult to interprete. I just took the code from the paper, when I tried to run it as simple minimal version i just called for "type" and "countsFromAbundance"

Hmm, I think the problem may be that these are not in RSEM format. Due to other code problems, I've had to hard code the parsing of each column, and so I can only support files output directly by RSEM.

I don't think tximport() is useful for you here, and I'd recommend just reading in the individual files and building a gene count matrix manually. You can round() this and provide it to DESeqDataSetFromMatrix (for DESeq2) or other packages like edgeR and limma can take fractional counts directly.

Ok, so here is an output i directly created by using our own sequenced data:

 gene_id transcript_id(s) length effective_length expected_count TPM FPKM gene:ENSG00000000003_TSPAN6 transcript:ENST00000373020_TSPAN6-201,transcript:ENST00000612152_TSPAN6-204,transcript:ENST00000614008_TSPAN6-205 2583.71 2355.23 154 0.92 1 gene:ENSG00000000005_TNMD transcript:ENST00000373031_TNMD-201 1339 1110.52 0 0 0 gene:ENSG00000000419_DPM1 transcript:ENST00000371582_DPM1-201,transcript:ENST00000371584_DPM1-202,transcript:ENST00000371588_DPM1-203,transcript:ENST00000413082_DPM1-204 1084.35 855.87 3106 51.33 55.27 gene:ENSG00000000457_SCYL3 transcript:ENST00000367770_SCYL3-201,transcript:ENST00000367771_SCYL3-202,transcript:ENST00000367772_SCYL3-203,transcript:ENST00000423670_SCYL3-204 4358.65 4130.17 1455.18 4.98 5.37 gene:ENSG00000000460_C1orf112 transcript:ENST00000286031_C1orf112-201,transcript:ENST00000359326_C1orf112-202,transcript:ENST00000413811_C1orf112-203,transcript:ENST00000459772_C1orf112-204,transcript:ENST00000466580_C1orf112-205,transcript:ENST00000472795_C1orf112-206,transcript:ENST00000481744_C1orf112-207,transcript:ENST00000496973_C1orf112-208 3127.66 2899.18 1603.82 7.82 8.43

only difference here is that it didn't do the posterior mean and calc-ci option for the additional columns compared to the Blueprint sample above.

This is the same data format?

I wanted to use tximport as to have a library-size normalized TPM metrics for downstream analysis. If I understand you correctly, for my purpose of comparing genes not isoforms this is not necessary because edgeR/Glimma can provide depth-normalization based on counts and this I could use on the TPMs from RSEM?

I think we're confusing some things here. TPM are already library size corrected. But TPM are not ideal for computing differential expression, because they have thrown out information about the precision of the abundance estimate. You can provide estimated counts to DESeq2, edgeR or limma and they will correct for library size. RSEM computes gene-level estimated counts the same way that tximport does, so I think you'll perhaps avoid more problems by assembling these matrices yourself.

Ok, I will do that then. Thanks a lot for your help, very educating!!

One thing about TPMs though: If I would want to compare between gene abundances in one sample-set vs another (e.g. one species vs the other), then I would choose TPMs, right? And log2 transform them for downstream DGE analysis.

Now, when RSEM provides effective counts by abundance estimation (expectation maximization they call it in their paper Haas BJ et al, 2013 Nat Protocols), then the RSEM TPMs should include that already, or am I confusing sth again? If so, then RSEM TPMs would be scaledTPMs?

Here's my recommendation, because I don't know if I can answer all of your questions, but I do want to help guide you to the right analysis: import gene-level estimated counts from your files (avoiding tximport only because you have some difficulty reading in these files). Pass this matrix of counts to your choice of DGE method in Bioconductor. You can work with the TPMs if you want, but then I don't have any software or analysis recommendations for you, because the methods I am familiar with make use of counts.

I just updated tximport to v1.7.14 on GitHub with new error testing code. Can you try this version?

Interestingly, now I get

Error: length(files) > 0 is not TRUE

I read that this is a common issue with RSEM and shouldn't this in your script already take care of it?

if (type == "rsem") {
# protect against 0 bp length transcripts
txi$length[txi$length < 1] <- 1
ADD REPLYlink modified 19 months ago by Michael Love26k • written 19 months ago by stephan.emmrich0

Good! I suspected this might be the issue so added a bit of code to protect you and future users from confusing error messages. The problem is that the first argument of tximport() needs to have at least 1 file. You are providing an empty vector.

Hi ,I have this length(files) > 0 is not TRUE. I don't understand this last response. I paste my code here, could you let me know if you have any idea how to fix this:

if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("tximport")
library(tximport)
samples <- list.files(path = "home/data/Continuum_ChIPSeq/R_txtimport", full.names = T, pattern="\\.salmon\$")
files <- file.path(samples, "quant.sf")
txi <- tximport(files, type="salmon", tx2gene=tx2gene[,c("tx_id", "ensgene")], countsFromAbundance="lengthScaledTPM")

1

The error means that your files vector is empty. Try things out in your R session and you may be able to debug the issue yourself:

E.g. if R is telling you that the files vector is empty, take a look at the files vector by printing it:

> files


What do you see? And if this is indeed an empty vector, take a look at the line where you construct files. Maybe there is something about the samples object, so take a look at that:

> samples


etc.