How to use GOseq without length bias
1
0
Entering edit mode
b.nota ▴ 360
@bnota-7379
Last seen 2.2 years ago
Netherlands

Hello,

I use GOseq quite often for RNAseq analyses, including the length bias correction. My question is how to use it properly for data without length bias, e.g., for microarrays.

Normally, with RNAseq data, I would use something like:

pwf <- nullp(gene.vector, "hg19", "geneSymbol", bias.data = lengthhg19)

GO.wall <- goseq(pwf, "hg19", "geneSymbol", use_genes_without_cat = TRUE)


But what to change for microarray data? Would it be enough to add method = "Hypergeometric" to the goseq function? Or do I also have to feed the nullp function with dummy values (let's say 100 for each gene)?

Ben

goseq • 1.6k views
1
Entering edit mode
@steve-lianoglou-2771
Last seen 3 months ago
United States

One option would be to still use the real gene lengths and pass those into the nullp function. This would presumably create a "flat line" plot (i.e. no bias) and you can proceed from there as normal.

0
Entering edit mode

Do you mean proceed with:

GO.wall <- goseq(pwf, "hg19", "geneSymbol", use_genes_without_cat = TRUE, method = "Hypergeometric")
2
Entering edit mode

Take a minute to read through the documentation from ?goseq. Specifically I'm referring to this section:

CAUTION:  "Hypergeometric" should NEVER be used for producing results for biological interpretation.  If there is genuinly no bias in power to detect DE in your experiment, the PWF will reflect this and the other methods will produce accuracte results.

"Hypergeometric" assumes there is no bias in power to detect differential expression at all and calculates the p-values using a standard hypergeometric distribution.  Useful if you wish to test the effect of selection bias on your results.

So, the lack of bias in DE detection should be reflected in the plot of your pwf and you can use the default method (Wallenius) when calling goseq. That having been said, they HyperGeometric option will ensure that no information from your pwf is used at all.

Lastly, it might be that there is no length bias from the microarray data, but there could very well be bias coming from another factor. You could imagine calculating your pwf from a bias.data vector that consists of GC content of the probes used for each expression readout. Not sure what that would look like, but might be interesting to see.

0
Entering edit mode

Thanks Steve for the nice explanation. I guess it would be safest to just use the Hypergeometric method for not microarray data.

0
Entering edit mode

Did you mean to say "I guess it would be safest to just use the Hypergeometric method for microarray data"?

You had a "not" in there, so wasn't sure. In any event, I wasn't suggesting that it's "safer" to do so. If I were doing the analysis and given that I have a tool at hand (goseq) that lets me ask questions about bias in DE quite easily, I'd just try a few different bias.data vectors (eg. length, gc content) just to see if there is bias I didn't expect.

Either way, people have been using HyperGeoemtric tests for GO enrichment with microarray data for a long time now, so you can just go that route if it suits your purpose.

0
Entering edit mode

Yes, the "not" slipped in there by accident. Sorry about that.

I am actually writing some additional functions to ease up (or automate) my GO analysis with goseq, and at the moment I don't have actual microarray studies, but just in case ;-)

Thanks again!

0
Entering edit mode

Wouldn't it be easiest to simply run goseq with a vector of dummy gene length, say rep(1000, number.of.genes)?

0
Entering edit mode

Random dummy lengths would work, but using the same value (1000) for all genes would trow an error. I tried that.

0
Entering edit mode

True. Something like bias <- jitter(rep(1000, length(genes)) should do the trick I guess, giving a completely flat fit.