Error while reading rsem files in tximport
1
0
Entering edit mode
@mohammedtoufiq91-17679
Last seen 5 weeks ago
United States

Hi,

I have RSEM output genes result file containing gene_id, effective_length, transcript_id, expected_count, TPM , and FPKM data values. My understanding is that DESeq2 can work with expected counts as output by RSEM, then normalize, and perform differential gene expression analysis. Does creating a combined expected counts only csv file (all 18 samples together) > rounding expected counts and importing in DESeqDataSetFromMatrix' should work? (OR)

    library("DESeq2")
    library("tximport") 
    library("readr")
    library("tximportData")
    Data <- "/Users/Documents/Projects/" 
    txi <- tximport(files = Data, type = "rsem", txIn = FALSE, txOut = FALSE) 
    ddsTxi <- DESeqDataSetFromTximport(txi,
                                       colData = samples,
                                       design = ~ condition)

    reading in files with read_tsv
    1                                             ### Fails to run; there are 18 samples

Now, through import from tximport package using individual sample (18 files): I tried using tximport, but unsuccessful. But I could import via below option with only one sample file:

txi_TEST <- tximport(files = "/Users/Documents/Projects/Sample_1.genes.results.txt", 
                         type = "rsem", 
                         txIn = FALSE, 
                         txOut = FALSE, 
                         countsFromAbundance = "scaledTPM")

reading in files with read_tsv
1 
Warning message:
In computeRsemGeneLevel(files, importer, geneIdCol, abundanceCol,  :
  countsFromAbundance other than 'no' requires transcript-level estimates


Sample_1.genes.results.txt
Sample_2.genes.results.txt
Sample_3.genes.results.txt
Sample_4.genes.results.txt
Sample_5.genes.results.txt
etc.,

Here is the example of the one sample:

structure(list(gene_id = c("ENSG00000000003", "ENSG00000000005", 
                           "ENSG00000000419", "ENSG00000000457", "ENSG00000000460", "ENSG00000000938"
), transcript_id.s. = c("ENST00000373020,ENST00000494424,ENST00000496771,ENST00000612152,ENST00000614008", 
                        "ENST00000373031,ENST00000485971", "ENST00000371582,ENST00000371584,ENST00000371588,ENST00000413082,ENST00000466152,ENST00000494752", 
                        "ENST00000367770,ENST00000367771,ENST00000367772,ENST00000423670,ENST00000470238", 
                        "ENST00000286031,ENST00000359326,ENST00000413811,ENST00000459772,ENST00000466580,ENST00000472795,ENST00000481744,ENST00000496973,ENST00000498289", 
                        "ENST00000374003,ENST00000374004,ENST00000374005,ENST00000399173,ENST00000457296,ENST00000468038,ENST00000475472"
), length = c(2211.19, 940.5, 1071.41, 4571.09, 3321.38, 2238.22
), effective_length = c(2045.08, 774.39, 905.3, 4404.98, 3155.27, 
                        2072.11), expected_count = c(615.84, 0, 1712, 455.05, 224.03, 
                                                     1446), TPM = c(9.81, 0, 61.63, 3.37, 2.31, 22.74), FPKM = c(9.62, 
                                                                                                                 0, 60.42, 3.3, 2.27, 22.29)), row.names = c(NA, 6L), class = "data.frame")                
#>           gene_id
#> 1 ENSG00000000003
#> 2 ENSG00000000005
#> 3 ENSG00000000419
#> 4 ENSG00000000457
#> 5 ENSG00000000460
#> 6 ENSG00000000938
#>                                                                                                                                  transcript_id.s.
#> 1                                                                 ENST00000373020,ENST00000494424,ENST00000496771,ENST00000612152,ENST00000614008
#> 2                                                                                                                 ENST00000373031,ENST00000485971
#> 3                                                 ENST00000371582,ENST00000371584,ENST00000371588,ENST00000413082,ENST00000466152,ENST00000494752
#> 4                                                                 ENST00000367770,ENST00000367771,ENST00000367772,ENST00000423670,ENST00000470238
#> 5 ENST00000286031,ENST00000359326,ENST00000413811,ENST00000459772,ENST00000466580,ENST00000472795,ENST00000481744,ENST00000496973,ENST00000498289
#> 6                                 ENST00000374003,ENST00000374004,ENST00000374005,ENST00000399173,ENST00000457296,ENST00000468038,ENST00000475472
#>    length effective_length expected_count   TPM  FPKM
#> 1 2211.19          2045.08         615.84  9.81  9.62
#> 2  940.50           774.39           0.00  0.00  0.00
#> 3 1071.41           905.30        1712.00 61.63 60.42
#> 4 4571.09          4404.98         455.05  3.37  3.30
#> 5 3321.38          3155.27         224.03  2.31  2.27
#> 6 2238.22          2072.11        1446.00 22.74 22.29

Created on 2022-11-30 with [reprex v2.0.2](https://reprex.tidyverse.org)

Thank you in advance for your help.

Thank you,

Toufiq

DESeq2 tximport rsem RNASeq tximportData • 1.2k views
ADD COMMENT
1
Entering edit mode
@james-w-macdonald-5106
Last seen 1 day ago
United States

The files argument for tximport has to be a set of files rather than a directory that contains the files. Instead try

Data <- "/Users/Documents/Projects/" 
## here I assume that the pattern 'genes.results.txt' will unambiguously select for your RSEM files
fls <- dir(Data, "genes.results.txt$", full.names = TRUE)
txi <- tximport(files = Data, type = "rsem")

As ATpoint noted, using txIn = FALSE is probably not what you want to be doing here.

ADD COMMENT
0
Entering edit mode

Dear ATpoint and @james-w-macdonald-5106

Thank you. I used the setting because I have the RSEM sample.genes.results files can be imported by setting type = "rsem", txIn = FALSE, txOut = FALSE. As pointed in RSEM via tximportenter code here

I have RSEM output genes result file containing gene_id, effective_length, transcript_id, expected_count, TPM , and FPKM data values. My understanding was that DESeq2 can work with output by RSEM, then normalize, and perform differential gene expression analysis. Initially, I thought of creating a combined expected counts file (all samples together) > rounding them and importing in DESeqDataSetFromMatrix', however, I learnt that tximport supports data output files from RSEM, I thought I will go according to this library.

I could execute using tximport (OR) tximeta overcoming a couple of challenges using the below steps using 2 approaches. I think I would go with the first approach.

library("tximport")
library("readr")
library("tximportData")
library("tximeta")

#########   ############   ############  1. Import via tximport  ############   ############   ############

dir <- file.path("/Users/Documents/Projects")
list.files(dir)
library("readxl")
Sample_metadata <- read_excel("/Users/Documents/Projects/Sample_anno.xlsx", sheet = 1)
rownames(Sample_metadata) <- Sample_metadata$File_Name
files <- file.path(dir, "rsem", Sample_metadata$File_Name)
files
names(files) <- Sample_metadata$File_Name
all(file.exists(files))
txi = tximport(files, type = "rsem", txIn = FALSE, txOut = FALSE)
head(txi$counts)
all(rownames(Sample_metadata)==colnames(txi[["counts"]]))

dds = DESeqDataSetFromTximport(txi, Sample_metadata, ~ Cohorts) ## Does not work because The issue is that RSEM here has estimated a gene length of zero, which is incompatible with our use of log average transcript length as an offset.
using counts and average transcript lengths from tximport
Error in DESeqDataSetFromTximport(txi, Sample_metadata, ~Cohorts) : 
  all(lengths > 0) is not TRUE

## https://bioinformatics.stackexchange.com/questions/13521/deseqdatasetfromtximport-alllengths-0-is-not-true
## For abudance
txi$abundance <-
  txi$abundance[apply(txi$length,
                      1,
                      function(row) all(row !=0 )),]
## For counts
txi$counts <-
  txi$counts[apply(txi$length,
                   1,
                   function(row) all(row !=0 )),]
txi$length <-
  txi$length[apply(txi$length,
                   1,
                   function(row) all(row !=0 )),]

ddsTxi = DESeqDataSetFromTximport(txi, Sample_metadata, ~ Cohorts)
dds <- DESeq(ddsTxi)
res <- results(dds)
res

#########   ############   ############   #########   ############   ############   #########   ############   ############

#########   ############   ############  2. Import via tximeta  ############   ############   ############

## Some demo code for reading in RSEM gene-level counts with tximeta and dealing with 0-length values.
## https://support.bioconductor.org/p/92763/

library(tximeta)
library(SummarizedExperiment)
dir <- file.path("/Users/Documents/Projects")
list.files(dir)
library("readxl")
Sample_metadata <- read_excel("/Users/Documents/Projects/Sample_anno.xlsx", sheet = 1)
class(Sample_metadata)
rownames(Sample_metadata) <- Sample_metadata$File_Name
files <- file.path(dir, "rsem", Sample_metadata$File_Name)
files
all(file.exists(files))

## http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#tximeta-for-import-with-automatic-metadata
coldata <- Sample_metadata
coldata$files <- files
coldata$names <- coldata$File_Name

?tximeta
se <- tximeta(coldata, type="rsem", txIn=FALSE, txOut=FALSE, skipMeta=TRUE)
# # set these as missing
assays(se)$length[ assays(se)$length == 0] <- NA 

## Examine how many genes have X missing values (consider X = half the samples):
idx <- rowSums(is.na(assays(se)$length)) >= 12    ## Out of 24 samples, I took 12 samples. Maybe I am wrong?
table(idx)
se <- se[!idx,]

##Impute lengths for the 0-length values:
library(impute)
length_imp <- impute.knn(assays(se)$length)
assays(se)$length <- length_imp$data
ddsTxi <- DESeqDataSet(se, design = ~ Cohorts)
dds <- DESeq(ddsTxi)
res <- results(dds)
res
ADD REPLY

Login before adding your answer.

Traffic: 819 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6