GOstats - are significantly enriched GO terms with a small "size" meaningful?
2
0
Entering edit mode
joasmile84 ▴ 10
@joasmile84-20206
Last seen 1 day ago
Switzerland

Hi all,

I am trying to perform a GO enrichment analysis on a list of differentially expressed genes and I am having doubts on the robustness of the results. I am using the GOstats package and the organism is Apis mellifera, so I had to make a custom gene universe set, which consists of 7903 genes for which GO terms were available. My codes are as follows:

library(GOstats)
library(GSEABase)
library(GOstatsPlus)

require(biomaRt)

mart <- useMart('metazoa_mart', host = 'metazoa.ensembl.org')
mart <- useDataset('amellifera_eg_gene', mart)

annot <- getBM(
mart = mart,
attributes = c(
'go_id',
'ensembl_gene_id'),
uniqueRows = TRUE)

library(dplyr)
annot<-subset(annot, trimws(annot$go_id) !="") goFrame=GOFrame(annot,organism="Apis mellifera") goAllFrame=GOAllFrame(goFrame) library("GSEABase") gsc <- GeneSetCollection(goAllFrame, setType = GOCollection()) gene_universe = head(unique(unlist(geneIds(gsc))),100000) library(stringi) gene_IDs_of_interest<-stri_join(AllDEGs, sep="") params <- GSEAGOHyperGParams(name="My Custom GSEA based annot Params", geneSetCollection=GO_universe, geneIds = gene_IDs_of_interest, universeGeneIds = gene_universe, ontology = "BP", pvalueCutoff = 0.05, conditional = FALSE, testDirection = "over") BP_Over <- hyperGTest(params) summary(BP_Over)  My DEG list consisted of 96 genes, so rather short, but I get 52 biological process terms to be significant at a 0.05 p value cutoff. However, when I look more closely I see that many of these significant terms have a size of 1. For example: GOBPID Pvalue OddsRatio ExpCount Count Size GO:0046098 0.007278609 Inf 0.007278609 1 1 So, my question is whether this is an OK result? Are terms with size 1 ok to keep in the test or should have I applied some initial filtering? I could not find a discussion of this in the GOstats vignettes so I hope you can help me. Another doubt is, should I correct these GO term p values for multiple testing? My approach has so far been to not apply an FDR correction at this stage but then "condense" the results with a tool like Revigo to bin terms by semantic similarity to better summarize the results. Would you have any other suggestion to improve the accuracy of these analyses? Thank you very much in advance for your kind help! GOstats GO RNASeq • 129 views ADD COMMENT 0 Entering edit mode @james-w-macdonald-5106 Last seen 1 day ago United States I normally do summary(BP_over, categorySize = 10)  Because significant results for GO terms with fewer than 10 genes may reflect mere chance rather than evidence for enrichment. 0 Entering edit mode Robert Castelo ★ 2.7k @rcastelo Last seen 3 days ago Barcelona/Universitat Pompeu Fabra hi, one could say that is easier to enrich a gene set that has just one gene, than a gene set that has 100, and from that perspective you may want to filter out gene sets below certain size. In the end, what you're looking for is replicability, i.e., the biological function that you see altered in an experiment, by seeing some enriched gene set, you expect to see it altered in another dataset obtained by conducting the same experiment. A way to attempt selecting altered biological functions that replicate is to keep gene sets with a robust enrichment and that robustness comes usually from having a minimum amount of genes enriching that gene set, something like 3, 4 or 5 genes. you can do that by filtering on the column Count from the data.frame returned by summary(). another useful way to look at those results is ordering them by odds ratio (column OddsRatio) in decreasing order, since that gives you a magnitude of the enrichment, and maybe discarding those enrichments with OR < 1.5. with respect to multiple testing, it's more tricky because here the one-tailed Fisher's exact test p-values returned by GOstats are not independent, in the sense that overlapping GO terms may have correlated p-values. In general this means that you can potentially be over-correcting those p-values, which in the end means the correction may be conservative. Others in this forum may have more informed or specific opinions or advice about this. If you decide anyway to apply some p-value correction, one problem with your code above is that the function summary() is only giving you the GO terms with a p-value below the threshold you put on the GSEAGOHyperGParams object (pvalueCutoff = 0.05). So, you would need to build another data.frame with all the GO terms tested by GOstats and apply the correction there. You can build such a data.frame as follows: BP_Over_All <- data.frame(GOBPID=names(geneIdUniverse(BP_Over)), Pvalue=pvalues(BP_Over), OddsRatio=oddsRatios(BP_Over), ExpCount=expectedCounts(BP_Over), Count=geneCounts(BP_Over), Size=universeCounts(BP_Over))  and then do the correction on BP_Over_All$Pvalue.

1
Entering edit mode

I believe that BH remains accurate under positive dependence, and BY remains accurate under arbitrary dependence structures. That is, if I understand the second reference in ?p.adjust.

Gordon Smyth is probably the one to ask though, since he's the one who contributed the code for BH and BY in p.adjust, and IIRC, the interpretation of BH in R and the original are slightly different (the original estimating the proportion of false positives and Gordon's version estimating the maximum proportion of false positives).