The tximport vignette suggests to just use RSEM’s summary from transcripts to genes (ie, rsem.genes.results.txt) instead of using tximport to do the summary from the transcripts file (i.e., use rsem.isoforms.results.txt, set type="none"
, and supplying the appropriate column names as arguments). Playing around I noticed that the main difference between the two approaches is the effective length, which I need to use for differential expression analysis in edgeR. Is there a problem with how tximport gets the effective length from RSEM output? I want to use a different mapping from transcripts to genes and I want to be sure that there is no problem with doing so using tximport.
Thanks,
Ty
Hi Michael,
In general the lengths are similar, but some are very different. In case it helps, I've also included the code that I'm using to generate the results.
Also, in order to make
type="none"
work I had to insert the following into the tximport function:Thanks,
Ty
PS I tried to get a minimal reproducible example, but reducing the file to a small number of genes/transcripts eliminated the differences, and I couldn't find any RSEM output files online to use instead. I know this type of question/big report is harder to address without actual data. I can email you the RSEM files directly if it would help.
If you look at the gene with the biggest discrepancy, anything particular about it? What length does tximport give and what length does RSEM give? What is the distribution of expression across isoforms for this gene? What are the different lengths of the isoforms?
This gene has 12 transcripts, most of which have lengths <700bp. Two transcripts have lengths in the low thousands of bps, and 1 transcript has length ~25,000bp. All transcripts have 0 expected counts except for the long transcript which has 1 expected count. RSEM gives an effective length of ~25,000bp for that gene, and tximport gives an expected length of ~2,800bp.
I looked at the next 5 worst genes (biggest length difference between RSEM and tximport) and they all showed the same pattern: a very long transcript with a small number of counts, one or more significantly shorter transcripts with no counts, and RSEM's effective length was much greater than tximport's.
There was a similar but less stark pattern when I searched out the genes where tximport's effective length was significantly greater than RSEM's: there was at least one large and one small transcript for these genes, and the transcripts all had a 0 or a small number of counts, although in some cases the long and short transcripts all had non-zero counts.
So for situations like this the standard error of the gene length is going to be in the kilobase range, no matter what.
tximport should be doing the same thing as RSEM unless the abundances are equal to zero, so I'm not sure.
But the good news is that this difference in length doesn't matter because these genes won't go into your edgeR analysis. They have no power for differential expression and you'll filter them out before analysis, right?
Many of these genes end up having a reasonable number of expected counts in other samples, so they don't get filtered out before the edgeR analysis. In the data set that I'm working with I end up with less than 5% of the differentially expressed genes differing between the two approaches (RSEM and tximport) across 3 different contrasts. So in the end it doesn't have a huge impact.
When I looked at the fold change and p-value of these genes, they all had nearly identical fold changes and the p-values were right around the magical 0.05 using both approaches, so really there was no substantial difference. I'm happy to know that there is a small difference that probably isn't going to mess with my results, and I can chalk this up as slightly different models that is pretty much at the level of rounding error. Thanks for your help!
Thanks for following up.
One more question: you said one of the transcripts has an expected count of 1. Is this transcript present in tx2gene?
Yup, that transcript is present in tx2gene. I get tx2gene from the isoforms.results.txt file, so nothing should be missing or differ between RSEM and tximport. Below are the rows for isoforms.results.txt for that gene in case it helps. I'll also email you the genes and isoforms rsem files in case it helps to see it all together. Thanks!
Ok got it! I figured it out. In tximport we use the abundances to come up with the average transcript length. It is the weighted average, using the formula: 1/sum_iso {abundance_iso} sum_iso {length_iso * abundance_iso}. Here the abundances are all 0. So in this case, as far as tximport is concerned, the average transcript length is undefined for this sample and so we look to the other samples to derive an empirical estimate for the average transcript length.
https://github.com/Bioconductor-mirror/tximport/blob/master/R/tximport.R#L386
Thanks again for following up.
OK, here's a follow-up question. It sounds like rsem is already computing the gene-level TPM value using something similar to the method tximports uses. (Except, perhaps, for genes with a large variance in transcript lengths.) Is this the case? If so, what's the difference between rsem's "expected_count" and the output that tximport will produce from a rsem *.genes.results file. Can one expect some improvement by using tximport in this case?
Thanks,
- Bill
For gene-level RSEM matrices, it is recommended to just read in the data from the "sample.genes.results" file using tximport. RSEM does the same thing that tximport would do if you read in the transcript level data and summarized, so there is no need to read in the transcript level files. tximport in this case is just a utility for binding the sample-wise vectors into three matrices.