There have been a number of questions asked on this topic (I've linked the relevant threads below), but I'm not sure if anyone has found a working solution. I have run GOSeq on my gene set and am looking to identify the genes in each category (numDEInCat). Here is what I've done so far. After obtaining go_bp as the result of running GOSeq:
category over_represented_pvalue under_represented_pvalue numDEInCat numInCat term ontology 1667 GO:0006955 7.257450e-14 1 135 668 immune response BP 6542 GO:0045087 1.235722e-13 1 83 336 innate immune response BP 578 GO:0002376 3.926914e-13 1 229 1369 immune system process BP 2271 GO:0009605 6.242604e-12 1 186 1079 response to external stimulus BP 2272 GO:0009607 5.035819e-11 1 98 478 response to biotic stimulus BP 10415 GO:0098542 7.923474e-11 1 68 286 defense response to other organism BP
So for example, I want to know what those 135 genes are in the first category.
To get the right gene mappings, I did this, where diff_exp_ensembl is just a named vector containing genes and 1 or 0 depending on whether it is differentially expressed or not:
go2gene = getBM(attributes = c("ensembl_gene_id", "mgi_symbol", "go_id"), mart = mouse) ## map from gene name to ensembl go2gene = go2gene[which(go2gene$ensembl_gene_id %in% names(diff_exp_ensembl)),]
Seems reasonable to me. But then when I run this line:
table(diff_exp_ensembl[which(names(diff_exp_ensembl) %in% go2gene[which(go2gene$go_id == "GO:0006955"),]$ensembl_gene_id)])
0 1 122 31
But there should be 135 differentially expressed genes (i.e. 135 should be 1, in the line above) according to GOSeq.
I've been struggling with this for quite a while- all of this seems quite clunky to me and I'm wondering if there's something I'm missing which makes it much easier. Here are the relevant threads I've found. Any tips are much appreciated!
R version 3.2.4 (2016-03-10) Platform: x86_64-apple-darwin13.4.0 (64-bit) Running under: OS X 10.9.5 (Mavericks) locale:  en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages:  parallel stats4 stats graphics grDevices utils datasets methods base other attached packages:  annotate_1.48.0 XML_3.98-1.4 BiocInstaller_1.20.1 org.Hs.eg.db_3.2.3  org.Mm.eg.db_3.2.3 biomaRt_2.26.1 TxDb.Mmusculus.UCSC.mm10.ensGene_3.2.2 GenomicFeatures_1.22.13  goseq_1.22.0 geneLenDataBase_1.6.0 BiasedUrn_1.07 DESeq2_1.10.1  RcppArmadillo_0.6.600.4.0 Rcpp_0.12.4 SummarizedExperiment_1.0.2 GenomicRanges_1.22.4  GenomeInfoDb_1.6.3 reactome.db_1.54.1 PANTHER.db_1.0.3 RSQLite_1.0.0  DBI_0.6-1 AnnotationDbi_1.32.3 IRanges_2.4.8 S4Vectors_0.8.11  Biobase_2.30.0 BiocGenerics_0.16.1 KEGGREST_1.10.1 loaded via a namespace (and not attached):  locfit_1.5-9.1 lattice_0.20-33 GO.db_3.2.2 png_0.1-7 Rsamtools_1.22.0 Biostrings_2.38.4  R6_2.1.2 plyr_1.8.3 futile.options_1.0.0 acepack_1.4.1 httr_1.1.0 ggplot2_2.1.0  zlibbioc_1.16.0 curl_0.9.6 rpart_4.1-10 Matrix_1.2-4 splines_3.2.4 BiocParallel_1.4.3  geneplotter_1.48.0 foreign_0.8-66 RCurl_1.95-4.8 munsell_0.4.3 rtracklayer_1.30.4 mgcv_1.8-12  nnet_7.3-12 gridExtra_2.2.1 Hmisc_3.17-3 GenomicAlignments_1.6.3 bitops_1.0-6 grid_3.2.4  nlme_3.1-126 xtable_1.8-2 gtable_0.2.0 scales_0.4.0 XVector_0.10.0 genefilter_1.52.1  latticeExtra_0.6-28 futile.logger_1.4.1 Formula_1.2-1 lambda.r_1.1.7 RColorBrewer_1.1-2 tools_3.2.4