Question

GOseq: count bias correction variations?

2

Entering edit mode

Christopher Conley ▴ 40

@christopher-conley-6142

Last seen 9.6 years ago

The vignette of GOseq details in section 6.8 how to correct for total read count bias (per gene). The code used to calculate total read count bias incorporates raw coverage counts instead of normalized counts.

When there are dramatic changes in library size, is it more appropriate to use the normalized counts instead? Would taking the log2 transformation of the total read counts also help improve the spline fitting of the probability weighting function (PWF)?

Thank you,
Chris Conley
Research Assistant
UC Davis Genome Center

Coverage goseq • 1.9k views

ADD COMMENT • link updated 7.0 years ago by Gordon Smyth 50k • written 9.7 years ago by Christopher Conley ▴ 40

0

Entering edit mode

Hey all,

I would also be really interested in an answer to this problem. Especially since in the voom paper it has been suggested that unequal library sizes lead to significantly different results. Additionally, new light-weight algorithms provide an "effective length" instead of the actual gene length, which already accounts for a number of biases.

ADD REPLY • link 7.0 years ago S.Bamopoulos ▴ 10

0

Entering edit mode

If you have a measure of effective gene length from software like Rsubread or tximport, then you can easily pass it to GOseq or goana(). You've always been able to do that. There is no need to use the gene length database that comes with GOseq in that case.

If you want to discuss more about gene length you should post your own question to Bioconductor, rather than adding comments to Christopher Conley's old post. Christopher's question was about read count bias rather than gene length bias.

ADD REPLY • link 7.0 years ago Gordon Smyth 50k

0

Entering edit mode

Hello Gordon,

I believe I understood Cristopher's question correctly, but I may be mistaken. I am using Salmon and from my understanding of the documentation the "effective length" given as an output accounts for at least some of the read count bias present in the data:

"EffectiveLength — This is the computed effective length of the target transcript. It takes into account all factors being modeled that will effect the probability of sampling fragments from this transcript, including the fragment length distribution and sequence-specific and gc-fragment bias (if they are being modeled)." - salmon documentation 0.8.1

So, again if I am not mistaken, by providing goseq with the "effective length" rather than simply the "length" of the salmon output files we account for both "gene length bias" and at least some "read length bias" (e.g. unequal library sizes). Since in most DE-workflows "read count bias" is accounted for anyway (e.g. by TMM normalization and/or voom-transformation), I was wondering if it is meaningful to doubly account for it by providing the "effective length".

ADD REPLY • link 7.0 years ago S.Bamopoulos ▴ 10

0

Entering edit mode

No, it isn't meaningful. Salmon is good at what it does, but the biases it adjusts for don't help at all with a goseq analysis.

The count bias that goseq corrects for doesn't have anything to do with unequal library sizes or voom or normalization methods. goseq is concerned with the fact that the data will provide you with more statistical power for some genes than others, and this fundamental fact is inescapable regardless of how you have processed your data.

goseq has two options -- you can adjust for either gene length (gene length bias) or for average read abundance (total count bias). If you choose the latter, then you are in effect adjusting for length bias as well, but an estimate of gene length is not required (nor is it helpful).

ADD REPLY • link 7.0 years ago Gordon Smyth 50k

score 0 · Answer 1 · 2017-04-16

First, let me say that I always find the term "normalized counts" quite unhelpful, because people frequently have different things in mind when they use this expression. It is not a well-defined term. Normalized for library size? Normalized by TMM factors? RPKM? TPM? Who knows! I always ask people to say what they actually mean.

Anyway, the GOseq method continues to work quite well even when the library sizes are unequal. The total read count still does a reasonably good job of sorting genes by statistical power, which is what is required.

The goana() function in the limma package implements a variation of the GOseq method. In the goana() implementation, the genes are ordered by average logCPM using effective library sizes instead of total raw count, and the probability weight function is fitted to the ranks of the genes rather than actual logCPM. I think the goana() implementation will work better than GOseq when the number of DE genes is small, but GOseq and goana() should both work well when there are lots of DE genes.