19 months ago by
This is a good question. The short answer is that, cqn run post-hoc probably captures most of the gene-level GC biases and length biases that can be modeled and used for bias correction within Salmon. By post-hoc I mean, removing bias after quantification. But I prefer Salmon's bias modeling and correction.
Salmon offers a more direct and comprehensive approach to bias correction (during quantification not post-hoc), and for this and other reasons Salmon => tximport is my preferred way to quantify before doing gene or transcript-level differential expression. Any bias which Salmon models and corrects for is passed through tximport to downstream statistical packages, using the recommended steps in the tximport vignette.
What do I mean by "direct and comprehensive"? Well, to accurately model biases, you ideally would need to know which transcript the fragments come from. Doing bias correction and transcript quantification simultaneously (or in alternating steps) allows you to be more accurate at both.
A simple toy example below: say we have four fragments aligning to these two isoforms of a gene. The second exon, which is alternative, has high GC content. There are two alternate possibilities: (i) we have no GC dependence and the first isoform is expressed, or (ii) the second isoform is expressed, but fragments are not generated from the second exon because it has high GC content, and the experiment had difficulty to amplify and sequence fragments with high GC content.
==== : isoform 1
==== ==== : isoform 2
The way to solve this obviously is to use information across all transcripts. You can imagine how cqn on gene-level counts and gene-level average GC content could catch most of the cases of dependence. But it's important to remember the fragments actually arise from transcripts, and so modeling bias and expression on the scale of transcripts can improve even gene-level inference. This is the argument in Soneson (2015), although that paper didn't go into depth on the GC bias issue.
In the alpine paper published last year, we show how GC bias is especially problematic for transcript-level abundance, and here I don't think that cqn can help much, because RNA-seq fragments need to be "moved" from one transcript to another during quantification, using all the information at hand. It really can't be done post-hoc in this case. Having all the information at hand is exactly how Salmon can correct for GC and other biases. Also, this point is not new, it's the argument presented by Roberts (2010), which proposed the random hexamer bias correction used in Cufflinks:
"Because expression levels also affect fragment abundances, it is necessary to jointly estimate transcript abundances and bias parameters in order to properly learn the bias directly from RNA-Seq data."