Differences in gene length passed to DESeq2 by tximport reulted in different gene rankings
1
0
Entering edit mode
LD • 0
@ld-15871
Last seen 4.3 years ago
USA/Dallas

I have used DESeq2 for differential gene expression analysis and found discrepancies of gene ranking from two pipelines upstream of DESeq2: STAR-rsem-tximport; STAR-salmon-tximport.. Some of the discrepancies are quite large and I picked one gene here that ranked No98 from rsem-tximport but No.3 from salmon-tximport. It turned out that both rsem and salmon gave the same count numbers but they are significantly different in lengths. Any advice would be appreciated.

Here is the counts from rsem-tximport:

> txi$counts["ENSMUSG00000040170.13",] 28 7 21 34 6 26 45 12 15 50 38 27 48 35 33 2 23 49 14 4 139 108 119 561 100 76 242 206 208 116 405 201 54 78 98 84 285 52 91 88 5 9 3 98 137 111 Here is the counts from salmon-tximport > txi$counts["ENSMUSG00000040170.13",]
28         7        21        34         6        26        45        12
139.00000 108.00000 119.00000 561.00004 100.00000  76.00000 242.00001 206.00000
15        50        38        27        48        35        33         2
208.00000 116.00000 405.00000 201.00007  54.00000  78.00002  98.00000  84.00002
23        49        14         4         5         9         3
285.00001  52.00000  91.00000  88.00000  98.00000 137.00000 111.00001 

gene length from rsem-tximport

> txi$length["ENSMUSG00000040170.13",] 28 7 21 34 6 26 45 12 15 50 3549.18 3529.81 3202.68 2884.12 3425.77 3419.01 3424.29 3581.70 3648.28 3573.82 38 27 48 35 33 2 23 49 14 4 3636.29 2926.10 2974.95 3017.98 3649.82 2255.17 3631.78 3567.58 3545.08 3017.01 5 9 3 3746.35 3772.59 3405.43 gene length from salmon-tximport > txi$length["ENSMUSG00000040170.13", ]
28         7        21        34         6        26        45        12
319.5377 2249.3429 1809.2934  325.9756  947.2694 2014.2411  296.0048 1795.9675
15        50        38        27        48        35        33         2
402.6961  566.5026  462.6234  366.7090 2158.0163 1717.6544  950.9149  595.1135
23        49        14         4         5         9         3
879.1492 1908.8248  862.0149  291.5122 3475.7957 2344.3731  552.1021

deseq2 tximport rsem saimon • 1.6k views
0
Entering edit mode
@mikelove
Last seen 4 days ago
United States

The effective length of the gene is estimated by both software and is a critical normalization factor. So I'm not totally surprised that different estimated lengths lead to different LFC and therefore gene ranking. If there is a gene that doesn't have multi-mapping reads, then the gene count (uncorrected for bias or isoform usage) is straightforward to estimate, whereas the trick is to estimate the biases and isoform abundances.

I don't know how you ran the software, but Salmon can take into account for example GC bias and we've shown on numerous real datasets that this can improve quantification if there happens to be any GC dependence.

0
Entering edit mode

Thanks for the comments. It seems that tximport takes effective length directly from rsem gene.results file. Look at the effective length from rsem output for sample 9, it has the exact number from tximport length above.

> genes.results.9["ENSMUSG00000040170.13","effective_length"]
[1] 3772.59


But for salmon, tximport has to calculates effective length from salmon transcripts quantification file. So it is tximport vs rsem, not sure which gene length is more appropriate.

I did not do --gcBias with salmon. I will run again with the flag turned on and see there is any differences.

0
Entering edit mode

tximport uses the estimated effective length of the transcripts, and computes a weighted average of these to produce a gene length.

So either the effective length of the transcripts could be different between Salmon and RSEM, or the isoform usage, or both, to produce a difference in the estimated gene length.

If GC bias was not used, then I would guess it's more differences in the isoform usage that are changing gene length.

0
Entering edit mode

Hi Michael, I just want to update to your comment above.

What I found is that both tximport-rsem and tximport-salmon generate highly correlated gene counts but very different gene length estimations. As shown below I have 53456 genes, and 40617 have the same counts across all samples.

> j = 0
> for ( i in 1:53456){
+ if( all(round(txi_rsem$counts[i,]) == round(txi_salmon$counts[i,])) == "TRUE" )(j <- j+1)
+ }
> j
[1] 40617​

Looking at gene length, not only the numbers are different, but the orders of sample (sorted by gene length) are different as well. As shown below there are only 1295 genes that have the same order.

> j = 0
> for ( i in 1:53456){
+ if( all(names(sort(txi_rsem$length[i,])) == names(sort(txi_salmon$length[i,]))) == "TRUE" )(j <- j+1)
+ }
> j
[1] 1295
​


I also compared the transcript length and they are very similar. Do you have comments on which gene length estimates is better?

PS. I did --gcBias with salmon, both gene counts and lengths are changed but the gene ranking are similar to the original salmon, relatively to rsem.

0
Entering edit mode

I don't really have any more informative comments on this. I wouldn't use code like the above to compare rank, but instead simply plotting rank(x) vs rank(y).

0
Entering edit mode

I wasn't trying to compare the ranks but was trying to show that it is the gene length estimation that drives differences of DESeq2 results. I wish someone could work it out that the top two rated quantification software have discrepancies that is large enough to puzzle biologist. I couldn't find any paper that compare these two software head to head. Thanks again for your comments.

0
Entering edit mode

I just noticed that in Salmon version 0.11.0 a fix has been applied to correct a bug that for multi-isoform genes resulted in the wrong calculation of lengths and effective lengths of genes. Although this was, according to the release notes, (apparently) *only* an issue when these numbers were calculated within Salmon, and not when using tximport, it may be worth to check this in more detail.

See Salmon v0.11.0 release notes here.

0
Entering edit mode

Thanks. We only use the effective transcript lengths and the transcript abundances in tximport to calculate the average transcript length (of a gene), so any miscalculation in gene lengths upstream won’t affect tximport results.