7 months ago by
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
First, let me say that I always find the term "normalized counts" quite unhelpful, because people frequently have different things in mind when they use this expression. It is not a well-defined term. Normalized for library size? Normalized by TMM factors? RPKM? TPM? Who knows! I always ask people to say what they actually mean.
Anyway, the GOseq method continues to work quite well even when the library sizes are unequal. The total read count still does a reasonably good job of sorting genes by statistical power, which is what is required.
The goana() function in the limma package implements a variation of the GOseq method. In the goana() implementation, the genes are ordered by average logCPM using effective library sizes instead of total raw count, and the probability weight function is fitted to the ranks of the genes rather than actual logCPM. I think the goana() implementation will work better than GOseq when the number of DE genes is small, but GOseq and goana() should both work well when there are lots of DE genes.