Question

Error while reading rsem files in tximport

0

Entering edit mode

mohammedtoufiq91 ▴ 10

@mohammedtoufiq91-17679

Last seen 5 weeks ago

United States

Hi,

I have RSEM output genes result file containing gene_id, effective_length, transcript_id, expected_count, TPM , and FPKM data values. My understanding is that DESeq2 can work with expected counts as output by RSEM, then normalize, and perform differential gene expression analysis. Does creating a combined expected counts only csv file (all 18 samples together) > rounding expected counts and importing in DESeqDataSetFromMatrix' should work? (OR)

    library("DESeq2")
    library("tximport") 
    library("readr")
    library("tximportData")
    Data <- "/Users/Documents/Projects/" 
    txi <- tximport(files = Data, type = "rsem", txIn = FALSE, txOut = FALSE) 
    ddsTxi <- DESeqDataSetFromTximport(txi,
                                       colData = samples,
                                       design = ~ condition)

    reading in files with read_tsv
    1                                             ### Fails to run; there are 18 samples

Now, through import from `tximport` package using individual sample (18 files): I tried using `tximport`, but unsuccessful. But I could import via below option with only one sample file:

txi_TEST <- tximport(files = "/Users/Documents/Projects/Sample_1.genes.results.txt", 
                         type = "rsem", 
                         txIn = FALSE, 
                         txOut = FALSE, 
                         countsFromAbundance = "scaledTPM")

reading in files with read_tsv
1 
Warning message:
In computeRsemGeneLevel(files, importer, geneIdCol, abundanceCol,  :
  countsFromAbundance other than 'no' requires transcript-level estimates


Sample_1.genes.results.txt
Sample_2.genes.results.txt
Sample_3.genes.results.txt
Sample_4.genes.results.txt
Sample_5.genes.results.txt
etc.,

Here is the example of the one sample:

structure(list(gene_id = c("ENSG00000000003", "ENSG00000000005", 
                           "ENSG00000000419", "ENSG00000000457", "ENSG00000000460", "ENSG00000000938"
), transcript_id.s. = c("ENST00000373020,ENST00000494424,ENST00000496771,ENST00000612152,ENST00000614008", 
                        "ENST00000373031,ENST00000485971", "ENST00000371582,ENST00000371584,ENST00000371588,ENST00000413082,ENST00000466152,ENST00000494752", 
                        "ENST00000367770,ENST00000367771,ENST00000367772,ENST00000423670,ENST00000470238", 
                        "ENST00000286031,ENST00000359326,ENST00000413811,ENST00000459772,ENST00000466580,ENST00000472795,ENST00000481744,ENST00000496973,ENST00000498289", 
                        "ENST00000374003,ENST00000374004,ENST00000374005,ENST00000399173,ENST00000457296,ENST00000468038,ENST00000475472"
), length = c(2211.19, 940.5, 1071.41, 4571.09, 3321.38, 2238.22
), effective_length = c(2045.08, 774.39, 905.3, 4404.98, 3155.27, 
                        2072.11), expected_count = c(615.84, 0, 1712, 455.05, 224.03, 
                                                     1446), TPM = c(9.81, 0, 61.63, 3.37, 2.31, 22.74), FPKM = c(9.62, 
                                                                                                                 0, 60.42, 3.3, 2.27, 22.29)), row.names = c(NA, 6L), class = "data.frame")                
#>           gene_id
#> 1 ENSG00000000003
#> 2 ENSG00000000005
#> 3 ENSG00000000419
#> 4 ENSG00000000457
#> 5 ENSG00000000460
#> 6 ENSG00000000938
#>                                                                                                                                  transcript_id.s.
#> 1                                                                 ENST00000373020,ENST00000494424,ENST00000496771,ENST00000612152,ENST00000614008
#> 2                                                                                                                 ENST00000373031,ENST00000485971
#> 3                                                 ENST00000371582,ENST00000371584,ENST00000371588,ENST00000413082,ENST00000466152,ENST00000494752
#> 4                                                                 ENST00000367770,ENST00000367771,ENST00000367772,ENST00000423670,ENST00000470238
#> 5 ENST00000286031,ENST00000359326,ENST00000413811,ENST00000459772,ENST00000466580,ENST00000472795,ENST00000481744,ENST00000496973,ENST00000498289
#> 6                                 ENST00000374003,ENST00000374004,ENST00000374005,ENST00000399173,ENST00000457296,ENST00000468038,ENST00000475472
#>    length effective_length expected_count   TPM  FPKM
#> 1 2211.19          2045.08         615.84  9.81  9.62
#> 2  940.50           774.39           0.00  0.00  0.00
#> 3 1071.41           905.30        1712.00 61.63 60.42
#> 4 4571.09          4404.98         455.05  3.37  3.30
#> 5 3321.38          3155.27         224.03  2.31  2.27
#> 6 2238.22          2072.11        1446.00 22.74 22.29

^{Created on 2022-11-30 with [reprex v2.0.2](https://reprex.tidyverse.org)}

Thank you in advance for your help.

Thank you,

Toufiq

DESeq2 tximport rsem RNASeq tximportData • 1.2k views

ADD COMMENT • link 16 months ago mohammedtoufiq91 ▴ 10

score 1 · Answer 1 · 2022-12-01

1

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 1 day ago

United States

The files argument for tximport has to be a set of files rather than a directory that contains the files. Instead try

Data <- "/Users/Documents/Projects/" 
## here I assume that the pattern 'genes.results.txt' will unambiguously select for your RSEM files
fls <- dir(Data, "genes.results.txt$", full.names = TRUE)
txi <- tximport(files = Data, type = "rsem")

As ATpoint noted, using txIn = FALSE is probably not what you want to be doing here.

ADD COMMENT • link 16 months ago James W. MacDonald 65k

0

Entering edit mode

Dear ATpoint and @james-w-macdonald-5106

Thank you. I used the setting because I have the RSEM sample.genes.results files can be imported by setting type = "rsem", txIn = FALSE, txOut = FALSE. As pointed in RSEM via tximportenter code here

I have RSEM output genes result file containing gene_id, effective_length, transcript_id, expected_count, TPM , and FPKM data values. My understanding was that DESeq2 can work with output by RSEM, then normalize, and perform differential gene expression analysis. Initially, I thought of creating a combined expected counts file (all samples together) > rounding them and importing in DESeqDataSetFromMatrix', however, I learnt that tximport supports data output files from RSEM, I thought I will go according to this library.

I could execute using tximport (OR) tximeta overcoming a couple of challenges using the below steps using 2 approaches. I think I would go with the first approach.

library("tximport")
library("readr")
library("tximportData")
library("tximeta")

#########   ############   ############  1. Import via tximport  ############   ############   ############

dir <- file.path("/Users/Documents/Projects")
list.files(dir)
library("readxl")
Sample_metadata <- read_excel("/Users/Documents/Projects/Sample_anno.xlsx", sheet = 1)
rownames(Sample_metadata) <- Sample_metadata$File_Name
files <- file.path(dir, "rsem", Sample_metadata$File_Name)
files
names(files) <- Sample_metadata$File_Name
all(file.exists(files))
txi = tximport(files, type = "rsem", txIn = FALSE, txOut = FALSE)
head(txi$counts)
all(rownames(Sample_metadata)==colnames(txi[["counts"]]))

dds = DESeqDataSetFromTximport(txi, Sample_metadata, ~ Cohorts) ## Does not work because The issue is that RSEM here has estimated a gene length of zero, which is incompatible with our use of log average transcript length as an offset.
using counts and average transcript lengths from tximport
Error in DESeqDataSetFromTximport(txi, Sample_metadata, ~Cohorts) : 
  all(lengths > 0) is not TRUE

## https://bioinformatics.stackexchange.com/questions/13521/deseqdatasetfromtximport-alllengths-0-is-not-true
## For abudance
txi$abundance <-
  txi$abundance[apply(txi$length,
                      1,
                      function(row) all(row !=0 )),]
## For counts
txi$counts <-
  txi$counts[apply(txi$length,
                   1,
                   function(row) all(row !=0 )),]
txi$length <-
  txi$length[apply(txi$length,
                   1,
                   function(row) all(row !=0 )),]

ddsTxi = DESeqDataSetFromTximport(txi, Sample_metadata, ~ Cohorts)
dds <- DESeq(ddsTxi)
res <- results(dds)
res

#########   ############   ############   #########   ############   ############   #########   ############   ############

#########   ############   ############  2. Import via tximeta  ############   ############   ############

## Some demo code for reading in RSEM gene-level counts with tximeta and dealing with 0-length values.
## https://support.bioconductor.org/p/92763/

library(tximeta)
library(SummarizedExperiment)
dir <- file.path("/Users/Documents/Projects")
list.files(dir)
library("readxl")
Sample_metadata <- read_excel("/Users/Documents/Projects/Sample_anno.xlsx", sheet = 1)
class(Sample_metadata)
rownames(Sample_metadata) <- Sample_metadata$File_Name
files <- file.path(dir, "rsem", Sample_metadata$File_Name)
files
all(file.exists(files))

## http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#tximeta-for-import-with-automatic-metadata
coldata <- Sample_metadata
coldata$files <- files
coldata$names <- coldata$File_Name

?tximeta
se <- tximeta(coldata, type="rsem", txIn=FALSE, txOut=FALSE, skipMeta=TRUE)
# # set these as missing
assays(se)$length[ assays(se)$length == 0] <- NA 

## Examine how many genes have X missing values (consider X = half the samples):
idx <- rowSums(is.na(assays(se)$length)) >= 12    ## Out of 24 samples, I took 12 samples. Maybe I am wrong?
table(idx)
se <- se[!idx,]

##Impute lengths for the 0-length values:
library(impute)
length_imp <- impute.knn(assays(se)$length)
assays(se)$length <- length_imp$data
ddsTxi <- DESeqDataSet(se, design = ~ Cohorts)
dds <- DESeq(ddsTxi)
res <- results(dds)
res

ADD REPLY • link 16 months ago mohammedtoufiq91 ▴ 10

Now, through import from tximport package using individual sample (18 files): I tried using tximport, but unsuccessful. But I could import via below option with only one sample file:

Now, through import from `tximport` package using individual sample (18 files): I tried using `tximport`, but unsuccessful. But I could import via below option with only one sample file: