Question: should I use rsem's summary to genes or tximport's?
1
3.0 years ago by
ty.thomson10
ty.thomson10 wrote:

The tximport vignette suggests to just use RSEM’s summary from transcripts to genes (ie, rsem.genes.results.txt) instead of using tximport to do the summary from the transcripts file (i.e., use rsem.isoforms.results.txt, set type="none", and supplying the appropriate column names as arguments).  Playing around I noticed that the main difference between the two approaches is the effective length, which I need to use for differential expression analysis in edgeR.  Is there a problem with how tximport gets the effective length from RSEM output?  I want to use a different mapping from transcripts to genes and I want to be sure that there is no problem with doing so using tximport.

Thanks,

Ty

tximport • 1.2k views
modified 3.0 years ago by Michael Love23k • written 3.0 years ago by ty.thomson10
Answer: should I use rsem's summary to genes or tximport's?
2
3.0 years ago by
Michael Love23k
United States
Michael Love23k wrote:

hi Ty,

Can you give me an example of the differences in length? I'm sure they could exist given minor details in implementation but theoretically i think we're doing the same thing.

Can you calculate the RMSE for one sample's lengths across all genes (this will be on the bp scale):

sqrt(mean((x-y)^2))

And give an example of the largest difference:

which.max(abs(x-y))

Hi Michael,

In general the lengths are similar, but some are very different.  In case it helps, I've also included the code that I'm using to generate the results.

> tx2ensembl <- read.delim(rsem.files[1],
+                          stringsAsFactors=F)[,c(1,2)]
>
> txi_1 <- tximport(rsem.files[1],
+                   type="none",
+                   txIn = TRUE,
+                   txIdCol = "transcript_id",
+                   abundanceCol = "FPKM",
+                   countsCol = "expected_count",
+                   lengthCol = "effective_length",
+                   tx2gene=tx2ensembl)
1 |================================================================================| 100%   12 MB

summarizing abundance
summarizing counts
summarizing length
> txi_2 <- tximport(sub(".isoforms", ".genes", rsem.files[1], fixed=T),
+                   type="rsem",
1
>
> summary(txi_1$length[,1]-txi_2$length[,1])
Min.    1st Qu.     Median       Mean    3rd Qu.       Max.
-21830.000      0.000      0.000     -4.188      0.000   5760.000
> sqrt(mean((txi_1$length[,1]-txi_2$length[,1])^2))
[1] 158.1861
> which.max(abs(txi_1$length[,1]-txi_2$length[,1]))
ENSG00000205592
20295 

Also, in order to make type="none" work I had to insert the following into the tximport function:

if (type == "none") {
}

Thanks,

Ty

PS I tried to get a minimal reproducible example, but reducing the file to a small number of genes/transcripts eliminated the differences, and I couldn't find any RSEM output files online to use instead.  I know this type of question/big report is harder to address without actual data.  I can email you the RSEM files directly if it would help.

If you look at the gene with the biggest discrepancy, anything particular about it? What length does tximport give and what length does RSEM give? What is the distribution of expression across isoforms for this gene? What are the different lengths of the isoforms?

This gene has 12 transcripts, most of which have lengths <700bp.  Two transcripts have lengths in the low thousands of bps, and 1 transcript has length ~25,000bp.  All transcripts have 0 expected counts except for the long transcript which has 1 expected count.   RSEM gives an effective length of ~25,000bp for that gene, and tximport gives an expected length of ~2,800bp.

I looked at the next 5 worst genes (biggest length difference between RSEM and tximport) and they all showed the same pattern: a very long transcript with a small number of counts, one or more significantly shorter transcripts with no counts, and RSEM's effective length was much greater than tximport's.

There was a similar but less stark pattern when I searched out the genes where tximport's effective length was significantly greater than RSEM's: there was at least one large and one small transcript for these genes, and the transcripts all had a 0 or a small number of counts, although in some cases the long and short transcripts all had non-zero counts.

So for situations like this the standard error of the gene length is going to be in the kilobase range, no matter what.

tximport should be doing the same thing as RSEM unless the abundances are equal to zero, so I'm not sure.

But the good news is that this difference in length doesn't matter because these genes won't go into your edgeR analysis. They have no power for differential expression and you'll filter them out before analysis, right?

Many of these genes end up having a reasonable number of expected counts in other samples, so they don't get filtered out before the edgeR analysis.  In the data set that I'm working with I end up with less than 5% of the differentially expressed genes differing between the two approaches (RSEM and tximport) across 3 different contrasts.  So in the end it doesn't have a huge impact.

When I looked at the fold change and p-value of these genes, they all had nearly identical fold changes and the p-values were right around the magical 0.05 using both approaches, so really there was no substantial difference.  I'm happy to know that there is a small difference that probably isn't going to mess with my results, and I can chalk this up as slightly different models that is pretty much at the level of rounding error.  Thanks for your help!

Thanks for following up.

One more question: you said one of the transcripts has an expected count of 1. Is this transcript present in tx2gene?

Yup, that transcript is present in tx2gene.  I get tx2gene from the isoforms.results.txt file, so nothing should be missing or differ between RSEM and tximport.  Below are the rows for isoforms.results.txt for that gene in case it helps.  I'll also email you the genes and isoforms rsem files in case it helps to see it all together.  Thanks!

 transcript_id gene_id length effective_length expected_count TPM FPKM IsoPct ENST00000398702 ENSG00000205592 524 337.89 0 0 0 0 ENST00000423284 ENSG00000205592 546 359.83 0 0 0 0 ENST00000424466 ENSG00000205592 1057 870.66 0 0 0 0 ENST00000427572 ENSG00000205592 548 361.82 0 0 0 0 ENST00000454784 ENSG00000205592 24775 24588.66 1 0 0 100 ENST00000474954 ENSG00000205592 4606 4419.66 0 0 0 0 ENST00000484665 ENSG00000205592 470 284.14 0 0 0 0 ENST00000492952 ENSG00000205592 644 457.7 0 0 0 0 ENST00000541039 ENSG00000205592 563 376.79 0 0 0 0 ENST00000542482 ENSG00000205592 347 163.42 0 0 0 0 ENST00000543564 ENSG00000205592 683 496.69 0 0 0 0 ENST00000546043 ENSG00000205592 550 363.82 0 0 0 0

Ok got it! I figured it out. In tximport we use the abundances to come up with the average transcript length. It is the weighted average, using the formula: 1/sum_iso {abundance_iso} sum_iso {length_iso * abundance_iso}. Here the abundances are all 0. So in this case, as far as tximport is concerned, the average transcript length is undefined for this sample and so we look to the other samples to derive an empirical estimate for the average transcript length.

https://github.com/Bioconductor-mirror/tximport/blob/master/R/tximport.R#L386

Thanks again for following up.

OK, here's a follow-up question.   It sounds like rsem is already computing the gene-level TPM value using something similar to the method tximports uses.   (Except, perhaps, for genes with a large variance in transcript lengths.)    Is this the case?   If so, what's the difference between rsem's "expected_count" and the output that tximport will produce from a rsem *.genes.results file.   Can one expect some improvement by using tximport in this case?

Thanks,

- Bill

For gene-level RSEM matrices, it is recommended to just read in the data from the "sample.genes.results" file using tximport. RSEM does the same thing that tximport would do if you read in the transcript level data and summarized, so there is no need to read in the transcript level files. tximport in this case is just a utility for binding the sample-wise vectors into three matrices.

Content
Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.