I've attempted to use GOseq, and while I know that, due to the gene length correction, it is more appropriate for RNAseq data than topGO, I like that topGO takes the topology of the GO graph in to account. I'm trying to find a middle ground.
In his great great tutorial (http://www-huber.embl.de/users/klaus/Teaching/DESeq2Predoc2014.html#gene-ontology-enrichment-analysis), Bernd Klaus, uses the genefinder function to match 10 genes of similar expression levels (mean normalized count) to his DE genes of interest. He then uses this set as a background set to proceed with enrichment using topGO. I liked this approach, though I have an RNAseq data set with 4 different treatments with varied gene expression among the four and so I thought that matching based on average gene count might yield some inappropriate matches. I then thought couldn't I just use this matching method to pick genes with similar length I by that control for that bias similarly to GOseq.
Does this make sense?
Another question is, in the tutorial we search for 10 gene matches. Is this enough? I have a specific set of 70+ genes I'm interested in GO terms for and figured the background matching would only be 700 genes and that wouldn't be so representative of the whole set. Thoughts?
Any comments or suggestions are greatly appreciated.
My code based on the tutorial:
geneLengthMatrix<-as.matrix(geneLengthData) backG <- genefinder(geneLengthMatrix, rownames(DEgeneSet),numResults = 100, method = "manhattan") backG <- rownames(geneLengthMatrix)[as.vector(sapply(backG, function(x)x$indices))] backG <- setdiff(backG, rownames(DEgeneSet)) library(geneplotter) multidensity( list( all= log2(geneLengthMatrix[,1]), foreground =log2(geneLengthMatrix[rownames(DEgeneSet),]), background = log2(geneLengthMatrix[backG,])), xlab="log2 gene length", main = "Matching for enrichment analysis") geneIDs = rownames(dds) inUniverse = geneIDs %in% c(rownames(DEgeneSet), backG) inSelection = geneIDs %in% rownames(DEgeneSet) alg <- factor( as.integer( inSelection[inUniverse] ) ) names(alg) <- geneIDs[inUniverse] tgd<-new("topGOdata", ontology="BP", allGenes=alg, nodeSize=ns, annot= annFUN.gene2GO, gene2GO= CukeGO)