Question

should I use rsem's summary to genes or tximport's?

1

Entering edit mode

ty.thomson ▴ 10

@tythomson-10775

Last seen 8.5 years ago

The tximport vignette suggests to just use RSEM’s summary from transcripts to genes (ie, rsem.genes.results.txt) instead of using tximport to do the summary from the transcripts file (i.e., use rsem.isoforms.results.txt, set type="none", and supplying the appropriate column names as arguments). Playing around I noticed that the main difference between the two approaches is the effective length, which I need to use for differential expression analysis in edgeR. Is there a problem with how tximport gets the effective length from RSEM output? I want to use a different mapping from transcripts to genes and I want to be sure that there is no problem with doing so using tximport.

Thanks,

Ty

tximport • 2.8k views

ADD COMMENT • link updated 8.5 years ago by Michael Love 42k • written 8.5 years ago by ty.thomson ▴ 10

score 2 · Accepted Answer · 2016-05-26

2

Entering edit mode

Michael Love 42k

@mikelove

Last seen 12 hours ago

United States

hi Ty,

Can you give me an example of the differences in length? I'm sure they could exist given minor details in implementation but theoretically i think we're doing the same thing.

Can you calculate the RMSE for one sample's lengths across all genes (this will be on the bp scale):

sqrt(mean((x-y)^2))

And give an example of the largest difference:

which.max(abs(x-y))

Anything particular about this gene?

ADD COMMENT • link 8.5 years ago Michael Love 42k

0

Entering edit mode

Hi Michael,

In general the lengths are similar, but some are very different. In case it helps, I've also included the code that I'm using to generate the results.

> tx2ensembl <- read.delim(rsem.files[1],
+                          stringsAsFactors=F)[,c(1,2)]
> 
> txi_1 <- tximport(rsem.files[1], 
+                   type="none",
+                   txIn = TRUE,
+                   txIdCol = "transcript_id",
+                   abundanceCol = "FPKM",
+                   countsCol = "expected_count",
+                   lengthCol = "effective_length",
+                   reader=read_tsv,
+                   tx2gene=tx2ensembl)
reading in files
1 |================================================================================| 100%   12 MB

summarizing abundance
summarizing counts
summarizing length
> txi_2 <- tximport(sub(".isoforms", ".genes", rsem.files[1], fixed=T), 
+                   type="rsem",
+                   reader=read_tsv)
reading in files
1 
> 
> summary(txi_1$length[,1]-txi_2$length[,1])
      Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
-21830.000      0.000      0.000     -4.188      0.000   5760.000 
> sqrt(mean((txi_1$length[,1]-txi_2$length[,1])^2))
[1] 158.1861
> which.max(abs(txi_1$length[,1]-txi_2$length[,1]))
ENSG00000205592 
          20295

Also, in order to make type="none" work I had to insert the following into the tximport function:

if (type == "none") {
  importer <- reader
}

Thanks,

Ty

PS I tried to get a minimal reproducible example, but reducing the file to a small number of genes/transcripts eliminated the differences, and I couldn't find any RSEM output files online to use instead. I know this type of question/big report is harder to address without actual data. I can email you the RSEM files directly if it would help.

ADD REPLY • link 8.5 years ago ty.thomson ▴ 10

0

Entering edit mode

If you look at the gene with the biggest discrepancy, anything particular about it? What length does tximport give and what length does RSEM give? What is the distribution of expression across isoforms for this gene? What are the different lengths of the isoforms?

ADD REPLY • link 8.5 years ago Michael Love 42k

0

Entering edit mode

This gene has 12 transcripts, most of which have lengths <700bp. Two transcripts have lengths in the low thousands of bps, and 1 transcript has length ~25,000bp. All transcripts have 0 expected counts except for the long transcript which has 1 expected count. RSEM gives an effective length of ~25,000bp for that gene, and tximport gives an expected length of ~2,800bp.

I looked at the next 5 worst genes (biggest length difference between RSEM and tximport) and they all showed the same pattern: a very long transcript with a small number of counts, one or more significantly shorter transcripts with no counts, and RSEM's effective length was much greater than tximport's.

There was a similar but less stark pattern when I searched out the genes where tximport's effective length was significantly greater than RSEM's: there was at least one large and one small transcript for these genes, and the transcripts all had a 0 or a small number of counts, although in some cases the long and short transcripts all had non-zero counts.

ADD REPLY • link 8.5 years ago ty.thomson ▴ 10

0

Entering edit mode

So for situations like this the standard error of the gene length is going to be in the kilobase range, no matter what.

tximport should be doing the same thing as RSEM unless the abundances are equal to zero, so I'm not sure.

But the good news is that this difference in length doesn't matter because these genes won't go into your edgeR analysis. They have no power for differential expression and you'll filter them out before analysis, right?

ADD REPLY • link 8.5 years ago Michael Love 42k

0

Entering edit mode

Many of these genes end up having a reasonable number of expected counts in other samples, so they don't get filtered out before the edgeR analysis. In the data set that I'm working with I end up with less than 5% of the differentially expressed genes differing between the two approaches (RSEM and tximport) across 3 different contrasts. So in the end it doesn't have a huge impact.

When I looked at the fold change and p-value of these genes, they all had nearly identical fold changes and the p-values were right around the magical 0.05 using both approaches, so really there was no substantial difference. I'm happy to know that there is a small difference that probably isn't going to mess with my results, and I can chalk this up as slightly different models that is pretty much at the level of rounding error. Thanks for your help!

ADD REPLY • link 8.5 years ago ty.thomson ▴ 10

0

Entering edit mode

Thanks for following up.

One more question: you said one of the transcripts has an expected count of 1. Is this transcript present in tx2gene?

ADD REPLY • link 8.5 years ago Michael Love 42k

0

Entering edit mode

Yup, that transcript is present in tx2gene. I get tx2gene from the isoforms.results.txt file, so nothing should be missing or differ between RSEM and tximport. Below are the rows for isoforms.results.txt for that gene in case it helps. I'll also email you the genes and isoforms rsem files in case it helps to see it all together. Thanks!

transcript_id	gene_id	length	effective_length	expected_count	TPM	FPKM	IsoPct
ENST00000398702	ENSG00000205592	524	337.89	0	0	0	0
ENST00000423284	ENSG00000205592	546	359.83	0	0	0	0
ENST00000424466	ENSG00000205592	1057	870.66	0	0	0	0
ENST00000427572	ENSG00000205592	548	361.82	0	0	0	0
ENST00000454784	ENSG00000205592	24775	24588.66	1	0	0	100
ENST00000474954	ENSG00000205592	4606	4419.66	0	0	0	0
ENST00000484665	ENSG00000205592	470	284.14	0	0	0	0
ENST00000492952	ENSG00000205592	644	457.7	0	0	0	0
ENST00000541039	ENSG00000205592	563	376.79	0	0	0	0
ENST00000542482	ENSG00000205592	347	163.42	0	0	0	0
ENST00000543564	ENSG00000205592	683	496.69	0	0	0	0
ENST00000546043	ENSG00000205592	550	363.82	0	0	0	0

ADD REPLY • link 8.5 years ago ty.thomson ▴ 10

0

Entering edit mode

Ok got it! I figured it out. In tximport we use the abundances to come up with the average transcript length. It is the weighted average, using the formula: 1/sum_iso {abundance_iso} sum_iso {length_iso * abundance_iso}. Here the abundances are all 0. So in this case, as far as tximport is concerned, the average transcript length is undefined for this sample and so we look to the other samples to derive an empirical estimate for the average transcript length.

https://github.com/Bioconductor-mirror/tximport/blob/master/R/tximport.R#L386

Thanks again for following up.

ADD REPLY • link 8.5 years ago Michael Love 42k

0

Entering edit mode

OK, here's a follow-up question. It sounds like rsem is already computing the gene-level TPM value using something similar to the method tximports uses. (Except, perhaps, for genes with a large variance in transcript lengths.) Is this the case? If so, what's the difference between rsem's "expected_count" and the output that tximport will produce from a rsem *.genes.results file. Can one expect some improvement by using tximport in this case?

Thanks,

- Bill

ADD REPLY • link 8.4 years ago William Kath • 0

0

Entering edit mode

For gene-level RSEM matrices, it is recommended to just read in the data from the "sample.genes.results" file using tximport. RSEM does the same thing that tximport would do if you read in the transcript level data and summarized, so there is no need to read in the transcript level files. tximport in this case is just a utility for binding the sample-wise vectors into three matrices.

ADD REPLY • link 8.4 years ago Michael Love 42k