Question: Using tximport on rsem files produces error while trying to read in files
0
gravatar for stephan.emmrich
19 months ago by
stephan.emmrich0 wrote:

Hi,

I am trying to import RSEM output files into tximport. Using this code

 

sTPM_hsa <- tximport(files = rsem_files, type = "rsem", txIn = TRUE, txOut = FALSE, countsFromAbundance = "scaledTPM")

 

with

 

rsem_files <- list.files(sTPM_basedir, pattern = "genes", full.names = TRUE)
rsem_files <- rsem_files[file.info(rsem_files)$isdir]
rsem_files <- paste0(rsem_files, "/sTPM.sf")
rsem_files <- rsem_files[file.exists(rsem_files)]
names(rsem_files) <- basename(gsub("/sTPM.sf", "", rsem_files))

 

I get this:

 

Error in if (is.null(importer) & !kallisto.h5) { : 
  missing value where TRUE/FALSE needed

 

That points to line 168 in tximport.R, but I don"t get the problem?! I have the readr package installed, also even if not it should use another function to read in the files...

I could not find this issue in any related post, hopefully its not a too stupid mistake...

Appreciate any help!

 

Thank You,

Stephan 

rnaseq R tpm tximport rsem • 1.0k views
ADD COMMENTlink modified 19 months ago • written 19 months ago by stephan.emmrich0

I have 1.6. just downloaded today from Bioconductor

ADD REPLYlink written 19 months ago by stephan.emmrich0

Thanks!

 

When I do this, then

 

> install_github(“mikelove/tximport”)
Error: unexpected input in "install_github(“"

I am using RStudio1.1.442 with R3.4.4 on a windows 7 PC, could windows be the problem here?

ADD REPLYlink written 19 months ago by stephan.emmrich0

It may be curly quotes? Try typing it out instead of copy paste 

ADD REPLYlink written 19 months ago by Michael Love26k
Answer: Using tximport on rsem files produces error while trying to read in files
0
gravatar for Michael Love
19 months ago by
Michael Love26k
United States
Michael Love26k wrote:

What version of tximport do you have?

ADD COMMENTlink written 19 months ago by Michael Love26k

Can you try this and see if it fixes the error:

library(devtools)

install_github(“mikelove/tximport”)

Then restart R

ADD REPLYlink written 19 months ago by Michael Love26k

Great, that works! Now I have tximport 1.7.13.

But sadly that gives me same error as from the original post with that commands:

Error in if (is.null(importer) & !kallisto.h5) { : 
  missing value where TRUE/FALSE needed

Weirdly, when I run a sample to try out tximport v.1.7.13:

sTPM_hsa_TEST <- tximport(files = "./sTPM/Blueprint/C000WYB3.gene_quantification.rsem_grape2_crg.GRCh38.20150622.RESULTS", type = "rsem", countsFromAbundance = "scaledTPM")

it gives this error:

reading in files with read_tsv
1 Warning: 120966 parsing failures.
row # A tibble: 5 x 5 col     row col    expected               actual     file                                                            expected   <int> <chr>  <chr>                  <chr>      <chr>                                                           actual 1     1 length no trailing characters .10        './sTPM/Blueprint/C000WYB3.gene_quantification.rsem_grape2_crg~ file 2     1 NA     8 columns              15 columns './sTPM/Blueprint/C000WYB3.gene_quantification.rsem_grape2_crg~ row 3     2 length no trailing characters .50        './sTPM/Blueprint/C000WYB3.gene_quantification.rsem_grape2_crg~ col 4     2 NA     8 columns              15 columns './sTPM/Blueprint/C000WYB3.gene_quantification.rsem_grape2_crg~ expected 

 

But when I run that same sample code to try out tximport v.1.6.00, then the output has a "no" for every row in the countsFromAbundance column!?

So the simple one-sample tximport run with a local path gets a parsing error or does not make scaledTPMs, depending on version. The complex tximport run with multiple files gets stuck before anything is even read in. Maybe a problem with my variables or pathnames? Here again everything I typed for the multiple files run from your paper:

basedir <- "E:/work/Dropbox/Bioware/Rochester/HSC/RNA Seq/R"
sTPM_basedir <- paste0(basedir, "/sTPM/Blueprint")

rsem_files <- list.files(sTPM_basedir, pattern = "genes", full.names = TRUE)
rsem_files <- rsem_files[file.info(rsem_files)$isdir]
rsem_files <- paste0(rsem_files, "/sTPM.sf")
rsem_files <- rsem_files[file.exists(rsem_files)]
names(rsem_files) <- basename(gsub("/sTPM.sf", "", rsem_files))

sTPM_hsa <- tximport(files = rsem_files, type = "rsem", txIn = TRUE, txOut = FALSE, countsFromAbundance = "scaledTPM", readLength = 100)

Error in if (is.null(importer) & !kallisto.h5) { : 
  missing value where TRUE/FALSE needed
ADD REPLYlink written 19 months ago by stephan.emmrich0

I’m a bit lost on how this error is coming up. Tell me more about these files. They are isoform level results from RSEM (because you have txIn=TRUE)?

ADD REPLYlink modified 19 months ago • written 19 months ago by Michael Love26k

The files are from Blueprint DCC portal, primary analysis done by Grape2 pipeline (https://github.com/guigolab/grape-nf).

This is the header and first 5 rows of such a file (total 30).

gene_id transcript_id(s) length effective_length expected_count TPM FPKM posterior_mean_count posterior_standard_deviation_of_count pme_TPM pme_FPKM TPM_ci_lower_bound TPM_ci_upper_bound FPKM_ci_lower_bound FPKM_ci_upper_bound
ENSG00000000003.13 ENST00000373020.7,ENST00000494424.1,ENST00000496771.4,ENST00000612152.3,ENST00000614008.3 2074.1 1837.69 885 3.65 4.35 885 0 3.64 4.38 3.33516 3.96303 4.02217 4.77881
ENSG00000000005.5 ENST00000373031.4,ENST00000485971.1 940.5 704.26 0 0 0 0 0 0.03 0.04 0.000306 0.08152 0.000426 0.098299
ENSG00000000419.11 ENST00000371582.7,ENST00000371584.7,ENST00000371588.8,ENST00000413082.1,ENST00000466152.4,ENST00000494752.1 1079.47 843.07 5697 51.15 61.01 5697 0 50.66 61.02 49.2778 51.9333 59.392 62.5898
ENSG00000000457.12 ENST00000367770.4,ENST00000367771.9,ENST00000367772.7,ENST00000423670.1,ENST00000470238.1 1538 1301.6 16 0.09 0.11 16.01 0.11 0.11 0.13 0.06088 0.154207 0.074289 0.186764
ENSG00000000460.15 ENST00000286031.9,ENST00000359326.7,ENST00000413811.3,ENST00000459772.4,ENST00000466580.5,ENST00000472795.4,ENST00000481744.4,ENST00000496973.4,ENST00000498289.4 2066.69 1830.32 659 2.73 3.25 658.99 0.11 2.78 3.35 2.40965 3.16032 2.90657 3.81134
ENSG00000000938.11 ENST00000374003.6,ENST00000374004.4,ENST00000374005.6,ENST00000399173.4,ENST00000457296.4,ENST00000468038.1,ENST00000475472.4 2729 2492.6 4 0.01 0.01 4 0 0.07 0.08 0.023157 0.115701 0.027981 0.139542

 Actually I am not interested in isoform quantitation so far, as I want to compare to two other species, with gene symbol as common identifier, Due to multimapped reads fro the non-model organism isoform comparisions would be difficult to interprete. I just took the code from the paper, when I tried to run it as simple minimal version i just called for "type" and "countsFromAbundance"

ADD REPLYlink written 19 months ago by stephan.emmrich0

Hmm, I think the problem may be that these are not in RSEM format. Due to other code problems, I've had to hard code the parsing of each column, and so I can only support files output directly by RSEM.

I don't think tximport() is useful for you here, and I'd recommend just reading in the individual files and building a gene count matrix manually. You can round() this and provide it to DESeqDataSetFromMatrix (for DESeq2) or other packages like edgeR and limma can take fractional counts directly.

ADD REPLYlink written 19 months ago by Michael Love26k

Ok, so here is an output i directly created by using our own sequenced data:

gene_id transcript_id(s) length effective_length expected_count TPM FPKM
gene:ENSG00000000003_TSPAN6 transcript:ENST00000373020_TSPAN6-201,transcript:ENST00000612152_TSPAN6-204,transcript:ENST00000614008_TSPAN6-205 2583.71 2355.23 154 0.92 1
gene:ENSG00000000005_TNMD transcript:ENST00000373031_TNMD-201 1339 1110.52 0 0 0
gene:ENSG00000000419_DPM1 transcript:ENST00000371582_DPM1-201,transcript:ENST00000371584_DPM1-202,transcript:ENST00000371588_DPM1-203,transcript:ENST00000413082_DPM1-204 1084.35 855.87 3106 51.33 55.27
gene:ENSG00000000457_SCYL3 transcript:ENST00000367770_SCYL3-201,transcript:ENST00000367771_SCYL3-202,transcript:ENST00000367772_SCYL3-203,transcript:ENST00000423670_SCYL3-204 4358.65 4130.17 1455.18 4.98 5.37
gene:ENSG00000000460_C1orf112 transcript:ENST00000286031_C1orf112-201,transcript:ENST00000359326_C1orf112-202,transcript:ENST00000413811_C1orf112-203,transcript:ENST00000459772_C1orf112-204,transcript:ENST00000466580_C1orf112-205,transcript:ENST00000472795_C1orf112-206,transcript:ENST00000481744_C1orf112-207,transcript:ENST00000496973_C1orf112-208 3127.66 2899.18 1603.82 7.82 8.43

only difference here is that it didn't do the posterior mean and calc-ci option for the additional columns compared to the Blueprint sample above.

This is the same data format?

I wanted to use tximport as to have a library-size normalized TPM metrics for downstream analysis. If I understand you correctly, for my purpose of comparing genes not isoforms this is not necessary because edgeR/Glimma can provide depth-normalization based on counts and this I could use on the TPMs from RSEM?

ADD REPLYlink written 19 months ago by stephan.emmrich0

I think we're confusing some things here. TPM are already library size corrected. But TPM are not ideal for computing differential expression, because they have thrown out information about the precision of the abundance estimate. You can provide estimated counts to DESeq2, edgeR or limma and they will correct for library size. RSEM computes gene-level estimated counts the same way that tximport does, so I think you'll perhaps avoid more problems by assembling these matrices yourself.

ADD REPLYlink written 19 months ago by Michael Love26k

Ok, I will do that then. Thanks a lot for your help, very educating!!

One thing about TPMs though: If I would want to compare between gene abundances in one sample-set vs another (e.g. one species vs the other), then I would choose TPMs, right? And log2 transform them for downstream DGE analysis.

Now, when RSEM provides effective counts by abundance estimation (expectation maximization they call it in their paper Haas BJ et al, 2013 Nat Protocols), then the RSEM TPMs should include that already, or am I confusing sth again? If so, then RSEM TPMs would be scaledTPMs?

ADD REPLYlink written 19 months ago by stephan.emmrich0

Here's my recommendation, because I don't know if I can answer all of your questions, but I do want to help guide you to the right analysis: import gene-level estimated counts from your files (avoiding tximport only because you have some difficulty reading in these files). Pass this matrix of counts to your choice of DGE method in Bioconductor. You can work with the TPMs if you want, but then I don't have any software or analysis recommendations for you, because the methods I am familiar with make use of counts.

ADD REPLYlink written 19 months ago by Michael Love26k

I just updated tximport to v1.7.14 on GitHub with new error testing code. Can you try this version?

ADD REPLYlink written 19 months ago by Michael Love26k

Interestingly, now I get

Error: length(files) > 0 is not TRUE

I read that this is a common issue with RSEM and shouldn't this in your script already take care of it?

if (type == "rsem") {
# protect against 0 bp length transcripts
txi$length[txi$length < 1] <- 1
ADD REPLYlink modified 19 months ago by Michael Love26k • written 19 months ago by stephan.emmrich0

Good! I suspected this might be the issue so added a bit of code to protect you and future users from confusing error messages. The problem is that the first argument of tximport() needs to have at least 1 file. You are providing an empty vector.

ADD REPLYlink written 19 months ago by Michael Love26k

Hi ,I have this length(files) > 0 is not TRUE. I don't understand this last response. I paste my code here, could you let me know if you have any idea how to fix this:

if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install("tximport")
library(tximport)
samples <- list.files(path = "home/data/Continuum_ChIPSeq/R_txtimport", full.names = T, pattern="\\.salmon$")
files <- file.path(samples, "quant.sf")
tx2gene <- read.delim("tx2gene_grch38_ens94.txt")
head(tx2gene)
txi <- tximport(files, type="salmon", tx2gene=tx2gene[,c("tx_id", "ensgene")], countsFromAbundance="lengthScaledTPM")
ADD REPLYlink written 10 weeks ago by jason.williams0
1

The error means that your files vector is empty. Try things out in your R session and you may be able to debug the issue yourself:

E.g. if R is telling you that the files vector is empty, take a look at the files vector by printing it:

> files

What do you see? And if this is indeed an empty vector, take a look at the line where you construct files. Maybe there is something about the samples object, so take a look at that:

> samples

etc.

ADD REPLYlink written 10 weeks ago by Michael Love26k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 417 users visited in the last hour