Hello all,
I've attempted to use GOseq, and while I know that, due to the gene length correction, it is more appropriate for RNAseq data than topGO, I like that topGO takes the topology of the GO graph in to account. I'm trying to find a middle ground.
In his great great tutorial (http://www-huber.embl.de/users/klaus/Teaching/DESeq2Predoc2014.html#gene-ontology-enrichment-analysis), Bernd Klaus, uses the genefinder function to match 10 genes of similar expression levels (mean normalized count) to his DE genes of interest. He then uses this set as a background set to proceed with enrichment using topGO. I liked this approach, though I have an RNAseq data set with 4 different treatments with varied gene expression among the four and so I thought that matching based on average gene count might yield some inappropriate matches. I then thought couldn't I just use this matching method to pick genes with similar length I by that control for that bias similarly to GOseq.
Does this make sense?
Another question is, in the tutorial we search for 10 gene matches. Is this enough? I have a specific set of 70+ genes I'm interested in GO terms for and figured the background matching would only be 700 genes and that wouldn't be so representative of the whole set. Thoughts?
Any comments or suggestions are greatly appreciated.
-Ben
My code based on the tutorial:
geneLengthMatrix<-as.matrix(geneLengthData) backG <- genefinder(geneLengthMatrix, rownames(DEgeneSet),numResults = 100, method = "manhattan") backG <- rownames(geneLengthMatrix)[as.vector(sapply(backG, function(x)x$indices))] backG <- setdiff(backG, rownames(DEgeneSet)) library(geneplotter) multidensity( list( all= log2(geneLengthMatrix[,1]), foreground =log2(geneLengthMatrix[rownames(DEgeneSet),]), background = log2(geneLengthMatrix[backG,])), xlab="log2 gene length", main = "Matching for enrichment analysis") geneIDs = rownames(dds) inUniverse = geneIDs %in% c(rownames(DEgeneSet), backG) inSelection = geneIDs %in% rownames(DEgeneSet) alg <- factor( as.integer( inSelection[inUniverse] ) ) names(alg) <- geneIDs[inUniverse] tgd<-new("topGOdata", ontology="BP", allGenes=alg, nodeSize=ns, annot= annFUN.gene2GO, gene2GO= CukeGO)
Bernd thanks for your great detailed answer.
I had looked into using the variance stabilized count matrix or even the
counts(dds, normalized =T)
matrix to use in genefinder and this is where I started to think that I would be getting inappropriate matches for the background. For example, I am interested in genes DE with age and genotype - so they are upregulated in a specific sample group out of 4. I used a padj threshold of 0.05 and fold change threshold of 2 to call genes DE. Using the counts matrices would pull out background sets that are all relatively up in one tissue (even if they didn't pass my thresholds). I'm not sure if this is a great group to act as a background. Am I wrong?Regarding the gene length. Do you think that matching in this way is an appropriate method to address the gene length bias? Is there any precedence to do it this way? Though it makes sense to me, I couldn't find any papers addressing this issue in this way.
What are your thoughts?
What are the benefits of using matchit vs genefinder? I am slightly familiar with mahalanobis matching (my wife uses it in her political science research :-) ) But what am I gaining from this specific method vs mine above?
Thank you again for all your help, I do appreciate it immensely.
Ben
Hi Ben,
your first observation is very pertinent: Yes, using genefinder like this can lead to a background that contains all your "false negatives". Running an enrichment analysis with this set might make you loose a lot of power. However, you can always exclude these potential false negative from your background set. You don't have to accept the genefinder results as it is.
To me, the rationale behind matching by gene length is that longer genes have more counts, and as genes instead of samples become your sampling units in an enrichment analysis, it makes sense to select a background set, that has a comparable gene length distribution and thus (most probably) comparable expression. I don't have a direct ref here, but one could potentially cite the goseq paper
http://genomebiology.com/2010/11/2/R14/abstract
which uses a more sophisticated method to correct for gene length / expression strength)
I think there are no other direct benefits to using matchIT compared to genefinder other than matchIT's flexibility in the program options. Note that the mahalanobis distance in the one-dimensional case (i.e. gene length only) is just a scaled eucledian distance). For example, you could use the mahalnobis distance using both gene length and log-average expression. So I just wanted to point out another nice package that is not obvious to find for Bioconductors as it was written by political science people.
So basically I don't have very strong recommendations overall, but I can just provide you with some hints. I think as long as you carefully describe what you did, avoid sweeping doubts under the carpet and discuss potential shortcomings, whatever strategy you decide on in the end is fine.
Best wishes,
Bernd
Bernd,
Thanks for your advice and explanations.
After talking to my PI and discussing the matter I think we will opt to go with matching based on average gene expression as you originally described (not the counts matrix). We also thought of using the two variables (expression and length), as you mentioned, but we figure that expression levels would account for length as well. So basically as you describe in your code above. Maybe I'll run it once that way and see if I get any different results.
Thanks again for all your help.
Ben