Search
Question: GC/length bias correction in gene expression quantification - Is CQN necessary if using Salmon --GCBias option?
3
19 months ago by
dmr21020
dmr21020 wrote:

Hi,

Considering multiple RNA-seq samples for which we are interested in gene expression quantification, with the aim of using it for differential expression with DESeq2 and for eQTL analysis, what would be the best way to correct for GC content bias and transcript length bias?

I have previously used CQN after quantification with HTSeq to normalise for these biases, but I am now quantifying gene expression using Salmon and was wondering if that step was necessary if using the option --gcBias in Salmon, and then using DESeqDataSetFromTximport, which does a gene length correction, in order to input the data into DESeq2, either for differential expression or in order to variance transform the data (which can then be used for eQTL analysis).

In summary, is:

• HTSeq + CQN (GC and length correction) + DESeq2 (VST or differential expression - importing covariates from CQN)

Equivalent to:

• Salmon --gcBias + TximportData-DESeq2 (VST or differential expression)

Modulo the differences in quantification between HTSeq and Salmon.

Or do I need to do:

• Salmon (without --gcBias) + Tximport (to simply summarise to gene level without length scaling) + CQN + DESeq2 (VST or differential expression - importing covariates from CQN)

Thanks very much,

Delphine

modified 19 months ago by Michael Love19k • written 19 months ago by dmr21020

Just saw this, I'll answer tomorrow.

I added tags DESeq2 and tximport. Without these the post is not sent to the maintainer (me).

4
19 months ago by
Michael Love19k
United States
Michael Love19k wrote:

hi Delphine,

This is a good question. The short answer is that, cqn run post-hoc probably captures most of the gene-level GC biases and length biases that can be modeled and used for bias correction within Salmon. By post-hoc I mean, removing bias after quantification. But I prefer Salmon's bias modeling and correction.

Salmon offers a more direct and comprehensive approach to bias correction (during quantification not post-hoc), and for this and other reasons Salmon => tximport is my preferred way to quantify before doing gene or transcript-level differential expression. Any bias which Salmon models and corrects for is passed through tximport to downstream statistical packages, using the recommended steps in the tximport vignette.

What do I mean by "direct and comprehensive"? Well, to accurately model biases, you ideally would need to know which transcript the fragments come from. Doing bias correction and transcript quantification simultaneously (or in alternating steps) allows you to be more accurate at both.

A simple toy example below: say we have four fragments aligning to these two isoforms of a gene. The second exon, which is alternative, has high GC content. There are two alternate possibilities: (i) we have no GC dependence and the first isoform is expressed, or (ii) the second isoform is expressed, but fragments are not generated from the second exon because it has high GC content, and the experiment had difficulty to amplify and sequence fragments with high GC content.

 - -
- -
====      : isoform 1
==== ==== : isoform 2

The way to solve this obviously is to use information across all transcripts. You can imagine how cqn on gene-level counts and gene-level average GC content could catch most of the cases of dependence. But it's important to remember the fragments actually arise from transcripts, and so modeling bias and expression on the scale of transcripts can improve even gene-level inference. This is the argument in Soneson (2015), although that paper didn't go into depth on the GC bias issue.

In the alpine paper published last year, we show how GC bias is especially problematic for transcript-level abundance, and here I don't think that cqn can help much, because RNA-seq fragments need to be "moved" from one transcript to another during quantification, using all the information at hand. It really can't be done post-hoc in this case. Having all the information at hand is exactly how Salmon can correct for GC and other biases. Also, this point is not new, it's the argument presented by Roberts (2010), which proposed the random hexamer bias correction used in Cufflinks:

"Because expression levels also affect fragment abundances, it is necessary to jointly estimate transcript abundances and bias parameters in order to properly learn the bias directly from RNA-Seq data."

Thanks very much for your very clear answer. I had decided to go for that option, but was really unsure as to whether I was making the right choice, and it is very reassuring to have an expert's opinion!

Would you change your answer if I specified that I am using single end data or does that not have much impact on the accuracy of the bias correction?

Another good question. Salmon does I believe have the option to run GC bias with single end reads. This was implemented last year:

https://github.com/COMBINE-lab/salmon/issues/83