mean-variance trend plots with isoform and gene-level data from tximport
2
0
Entering edit mode
Ina Hoeschele ▴ 610
@ina-hoeschele-2992
Last seen 8 weeks ago
United States

Hi, I have isoform-level deep RNA-seq data from stringtie on about 800 people. I have created gene-level and isoform-level data using tximport. I have just looked at the voom mean-variance trend plots. At the gene level, the plot looks as usual (decreases at low expression levels and then levels off). For the isoform data, the plot looks different, the fit curve continues to decrease all along the x-axis without ever leveling off (so no variance stabilization). I have done this both with loose and quite stringent filtering (requiring at least 15 reads in 75% of individuals for the stringent filtering), and in both cases the plot looks the same. We do not need variance stabilization since we use the voom weights in subsequent analyses, but is this type of plot of any concern? Has this been seen for other isoform data? As expected the isoform data points cluster strongly on the left side of the graph, while the gene-level data cluster more in the middle. Thank you for any comments.

tximport isoform voom • 246 views
0
Entering edit mode
@mikelove
Last seen 8 hours ago
United States

Ina,

Not an answer to you question, but a side note, it would be interesting to see if uncertainty on the isoform-level quantification is playing a role. With Salmon this is possible by adding arguments such as --numGibbsSamples 30 --thinningFactor 100 or alternatively --numBootstraps 30. Then tximport will import the inferential replicates in addition to the counts (or with varReduce=TRUE it will compute inferential variance matrices for you, as a new matrix in the txi list). I know you have StringTie quantification here, but maybe it would be possible to experiment running transcript quantification with Salmon on a few samples to investigate how it relates to the pattern you see.

From my work in isoform quantification, I can provide some intuition about what may be happening: even if the count is high for the gene/isoform, there can still be substantial uncertainty on the assignment of read counts to isoforms. Then across samples you may see high variance due to the stochastic assignment of many reads across those isoforms with a lot of shared exonic sequence.

0
Entering edit mode
@gordon-smyth
Last seen 1 hour ago
WEHI, Melbourne, Australia

Yes, problems with the voom mean-variance trend is totally to be expected with isoform-level data and it is a concern. It will cause voom to be less much less powerful than it usually would be. The problem is caused by variance inflation due to overlap (and hence ambiguity) between isoforms. This is why voom has only ever been recommended for gene-level data.

As mentioned by Michael Love, the variance inflation can be estimated by bootstrapping if you use Salmon (although I have different recommendations for how to feed it into voom). If you use Stringtie then there's no much you can do about it.

0
Entering edit mode

Michael and Gordon, I very much appreciate your comments. I fully realize that voom (and other methods) were designed for gene-level analysis. Stringtie was used because one of the goals is to discover novel isoforms. I am planning to use Salmon at least on a subset of the samples (n=855), but I have not yet figured out what to do with the results from Salmon. First, Salmon can be run with SA or using the bam files from STAR alignment (which I have). Which of these two options do you recommend, I assume that bootstrapping can be performed with both. Second, once I have the bootstrap results from salmon/tximport, then what to do with them? You seem to have some ideas on how to incorporate the variance inflation into voom? While you mostly raise the issue of power, the 2017 NMETH sleuth paper seems to mostly focus on false positives, but I may not quite understand this paper yet: They estimate an inferential variance from bootstrapping, subtract this inferential variance from the total variance to obtain a biological variance which is then regularized, but given my large sample size regularization should not matter and in that case I do not see how their method would make any difference. So in summary, yes I plan to use Salmon, but it is not trivial to figure out what to do with the results from Salmon to improve differential expression analysis. Would be grateful for any suggestions …

0
Entering edit mode

Regarding SA and STAR, this is addressed in depth here:

https://www.biorxiv.org/content/10.1101/657874v2

We have published a paper and have a Bioconductor package with a method Swish for using inferential uncertainty in analysis: