Question: should I use rsem's summary to genes or tximport's?
1
gravatar for ty.thomson
3.4 years ago by
ty.thomson10
ty.thomson10 wrote:

The tximport vignette suggests to just use RSEM’s summary from transcripts to genes (ie, rsem.genes.results.txt) instead of using tximport to do the summary from the transcripts file (i.e., use rsem.isoforms.results.txt, set type="none", and supplying the appropriate column names as arguments).  Playing around I noticed that the main difference between the two approaches is the effective length, which I need to use for differential expression analysis in edgeR.  Is there a problem with how tximport gets the effective length from RSEM output?  I want to use a different mapping from transcripts to genes and I want to be sure that there is no problem with doing so using tximport. 

Thanks,

Ty

 

 

tximport • 1.3k views
ADD COMMENTlink modified 3.4 years ago by Michael Love25k • written 3.4 years ago by ty.thomson10
Answer: should I use rsem's summary to genes or tximport's?
2
gravatar for Michael Love
3.4 years ago by
Michael Love25k
United States
Michael Love25k wrote:

hi Ty,

Can you give me an example of the differences in length? I'm sure they could exist given minor details in implementation but theoretically i think we're doing the same thing.

Can you calculate the RMSE for one sample's lengths across all genes (this will be on the bp scale):

sqrt(mean((x-y)^2))

And give an example of the largest difference:

which.max(abs(x-y))

Anything particular about this gene?

ADD COMMENTlink written 3.4 years ago by Michael Love25k

Hi Michael,

In general the lengths are similar, but some are very different.  In case it helps, I've also included the code that I'm using to generate the results.   

> tx2ensembl <- read.delim(rsem.files[1],
+                          stringsAsFactors=F)[,c(1,2)]
> 
> txi_1 <- tximport(rsem.files[1], 
+                   type="none",
+                   txIn = TRUE,
+                   txIdCol = "transcript_id",
+                   abundanceCol = "FPKM",
+                   countsCol = "expected_count",
+                   lengthCol = "effective_length",
+                   reader=read_tsv,
+                   tx2gene=tx2ensembl)
reading in files
1 |================================================================================| 100%   12 MB

summarizing abundance
summarizing counts
summarizing length
> txi_2 <- tximport(sub(".isoforms", ".genes", rsem.files[1], fixed=T), 
+                   type="rsem",
+                   reader=read_tsv)
reading in files
1 
> 
> summary(txi_1$length[,1]-txi_2$length[,1])
      Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
-21830.000      0.000      0.000     -4.188      0.000   5760.000 
> sqrt(mean((txi_1$length[,1]-txi_2$length[,1])^2))
[1] 158.1861
> which.max(abs(txi_1$length[,1]-txi_2$length[,1]))
ENSG00000205592 
          20295 

 

Also, in order to make type="none" work I had to insert the following into the tximport function:

if (type == "none") {
  importer <- reader
}

 

Thanks,

Ty

PS I tried to get a minimal reproducible example, but reducing the file to a small number of genes/transcripts eliminated the differences, and I couldn't find any RSEM output files online to use instead.  I know this type of question/big report is harder to address without actual data.  I can email you the RSEM files directly if it would help.

ADD REPLYlink written 3.4 years ago by ty.thomson10

If you look at the gene with the biggest discrepancy, anything particular about it? What length does tximport give and what length does RSEM give? What is the distribution of expression across isoforms for this gene? What are the different lengths of the isoforms?

ADD REPLYlink written 3.4 years ago by Michael Love25k

This gene has 12 transcripts, most of which have lengths <700bp.  Two transcripts have lengths in the low thousands of bps, and 1 transcript has length ~25,000bp.  All transcripts have 0 expected counts except for the long transcript which has 1 expected count.   RSEM gives an effective length of ~25,000bp for that gene, and tximport gives an expected length of ~2,800bp.

I looked at the next 5 worst genes (biggest length difference between RSEM and tximport) and they all showed the same pattern: a very long transcript with a small number of counts, one or more significantly shorter transcripts with no counts, and RSEM's effective length was much greater than tximport's.

There was a similar but less stark pattern when I searched out the genes where tximport's effective length was significantly greater than RSEM's: there was at least one large and one small transcript for these genes, and the transcripts all had a 0 or a small number of counts, although in some cases the long and short transcripts all had non-zero counts. 

 

ADD REPLYlink written 3.4 years ago by ty.thomson10

So for situations like this the standard error of the gene length is going to be in the kilobase range, no matter what. 

tximport should be doing the same thing as RSEM unless the abundances are equal to zero, so I'm not sure.

But the good news is that this difference in length doesn't matter because these genes won't go into your edgeR analysis. They have no power for differential expression and you'll filter them out before analysis, right?

ADD REPLYlink written 3.4 years ago by Michael Love25k

Many of these genes end up having a reasonable number of expected counts in other samples, so they don't get filtered out before the edgeR analysis.  In the data set that I'm working with I end up with less than 5% of the differentially expressed genes differing between the two approaches (RSEM and tximport) across 3 different contrasts.  So in the end it doesn't have a huge impact.  

When I looked at the fold change and p-value of these genes, they all had nearly identical fold changes and the p-values were right around the magical 0.05 using both approaches, so really there was no substantial difference.  I'm happy to know that there is a small difference that probably isn't going to mess with my results, and I can chalk this up as slightly different models that is pretty much at the level of rounding error.  Thanks for your help!

ADD REPLYlink written 3.4 years ago by ty.thomson10

Thanks for following up.

One more question: you said one of the transcripts has an expected count of 1. Is this transcript present in tx2gene?

ADD REPLYlink written 3.4 years ago by Michael Love25k

Yup, that transcript is present in tx2gene.  I get tx2gene from the isoforms.results.txt file, so nothing should be missing or differ between RSEM and tximport.  Below are the rows for isoforms.results.txt for that gene in case it helps.  I'll also email you the genes and isoforms rsem files in case it helps to see it all together.  Thanks!

transcript_id gene_id length effective_length expected_count TPM FPKM IsoPct
ENST00000398702 ENSG00000205592 524 337.89 0 0 0 0
ENST00000423284 ENSG00000205592 546 359.83 0 0 0 0
ENST00000424466 ENSG00000205592 1057 870.66 0 0 0 0
ENST00000427572 ENSG00000205592 548 361.82 0 0 0 0
ENST00000454784 ENSG00000205592 24775 24588.66 1 0 0 100
ENST00000474954 ENSG00000205592 4606 4419.66 0 0 0 0
ENST00000484665 ENSG00000205592 470 284.14 0 0 0 0
ENST00000492952 ENSG00000205592 644 457.7 0 0 0 0
ENST00000541039 ENSG00000205592 563 376.79 0 0 0 0
ENST00000542482 ENSG00000205592 347 163.42 0 0 0 0
ENST00000543564 ENSG00000205592 683 496.69 0 0 0 0
ENST00000546043 ENSG00000205592 550 363.82 0 0 0 0
ADD REPLYlink written 3.4 years ago by ty.thomson10

Ok got it! I figured it out. In tximport we use the abundances to come up with the average transcript length. It is the weighted average, using the formula: 1/sum_iso {abundance_iso} sum_iso {length_iso * abundance_iso}. Here the abundances are all 0. So in this case, as far as tximport is concerned, the average transcript length is undefined for this sample and so we look to the other samples to derive an empirical estimate for the average transcript length.

https://github.com/Bioconductor-mirror/tximport/blob/master/R/tximport.R#L386

Thanks again for following up.

ADD REPLYlink written 3.4 years ago by Michael Love25k

OK, here's a follow-up question.   It sounds like rsem is already computing the gene-level TPM value using something similar to the method tximports uses.   (Except, perhaps, for genes with a large variance in transcript lengths.)    Is this the case?   If so, what's the difference between rsem's "expected_count" and the output that tximport will produce from a rsem *.genes.results file.   Can one expect some improvement by using tximport in this case?

Thanks,

- Bill

 

ADD REPLYlink written 3.3 years ago by William Kath0

For gene-level RSEM matrices, it is recommended to just read in the data from the "sample.genes.results" file using tximport. RSEM does the same thing that tximport would do if you read in the transcript level data and summarized, so there is no need to read in the transcript level files. tximport in this case is just a utility for binding the sample-wise vectors into three matrices.

ADD REPLYlink written 3.3 years ago by Michael Love25k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 324 users visited in the last hour