Hello!
I am using TopGO to analyze different lists of DEGs from maize obtained via analysis of RNA-seq data.
With the script below I get the list of overrepresented terms for the CC annotations. So far so good.
Now I would like to print out the significant genes associated with each of the overrepresented terms.
I am trying to use printGenes, but run into trouble defining what I should put in “chip” for my maize data.
In the script you will see the error message I get.
The first lines of the input data I am using looks like this:
Data to define the gene Universe maize_v3.agg.nr_CC.txt
Feature ID GO cellular component GRMZM2G104390 GO:0000228, GO:0005737, GO:0005667, GO:0000118 GRMZM2G104394 GO:0009534, GO:0048046, GO:0005783, GO:0009506, GO:0005739, GO:0009505, GO:0005774 GRMZM2G104396 GO:0016602, GO:0044451 GRMZM2G104397 GO:0012505, GO:0071944, GO:0005654, GO:0009507, GO:0005815, GO:0016020 List of DEGs UPREGULATED.txt GRMZM2G167669 GRMZM2G078279 GRMZM2G588623 GRMZM2G179092 GRMZM2G481440 GRMZM2G460594 GRMZM2G010356
Below you will also find my session Info.
I would appreciate your help figuring this one out.
Thanks in advance and looking forwards to your response
Kaperuza
# Analysis of cellular component GO Term enrichment library("topGO") geneID2GO <-readMappings(file="/Users/Kaperuza/maize_v3.agg.nr_CC.txt") geneUniverse <- names(geneID2GO) # Analysis list of DEGs genesOfInterest <-read.table("/Users/Kaperuza/UPREGULATED.txt", header=FALSE) genesOfInterest <- as.character(genesOfInterest$V1) geneList <- factor(as.integer(geneUniverse %in% genesOfInterest)) names(geneList) <- geneUniverse # Only use those ones which are annotated with at least 10 terms myGOdata <- new("topGOdata", description="UPREGULATED DESeq2", ontology="CC", allGenes=geneList, annot=annFUN.gene2GO, gene2GO=geneID2GO, nodeSize=10) sg <- sigGenes(myGOdata) str(sg) numSigGenes(myGOdata) test.stat <- new("weightCount", testStatistic = GOFisherTest, name = "Fisher test", sigRatio = "ratio") resultWeight <- getSigGroups(myGOdata, test.stat) allRes <- GenTable(myGOdata, weight = resultWeight, orderBy = "weight", ranksOf= "weight", topNodes =300) write.csv(as.data.frame(allRes), file="/Users/Kaperuza/UPREGULATED_DESeq2_topGO300_CC_Weight.csv”) #Trying to get the genes corresponding to each goid in the results gt <- printGenes(myGOdata, whichTerms = allRes$GO.ID, chip = "geneUniverse", numChar = 40) #Error in get(paste(chip, "ENTREZID", sep = "")) : # object 'geneUniverseENTREZID' not found
#traceback ()
5: get(paste(chip, "ENTREZID", sep = ""))
4: get(paste(chip, "ENTREZID", sep = ""))
3: .local(object, whichTerms, ...)
2: printGenes(myGOdata, whichTerms = allRes$GO.ID, chip = "geneUniverse",
numChar = 40)
1: printGenes(myGOdata, whichTerms = allRes$GO.ID, chip = "geneUniverse",
numChar = 40)
#The same happens if I use “GeneID2GO”
sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods
[9] base
other attached packages:
[1] topGO_2.32.0 SparseM_1.77 GO.db_3.6.0 AnnotationDbi_1.42.1
[5] IRanges_2.14.10 S4Vectors_0.18.3 Biobase_2.40.0 graph_1.58.0
[9] BiocGenerics_0.26.0
loaded via a namespace (and not attached):
[1] Rcpp_0.12.17 locfit_1.5-9.1 lattice_0.20-35
[4] digest_0.6.15 GenomeInfoDb_1.16.0 plyr_1.8.4
[7] backports_1.1.2 acepack_1.4.1 RSQLite_2.1.1
[10] ggplot2_2.2.1 pillar_1.2.3 zlibbioc_1.26.0
[13] rlang_0.2.1 lazyeval_0.2.1 rstudioapi_0.7
[16] data.table_1.11.4 annotate_1.58.0 blob_1.1.1
[19] rpart_4.1-13 Matrix_1.2-14 checkmate_1.8.5
[22] splines_3.5.0 BiocParallel_1.14.1 geneplotter_1.58.0
[25] stringr_1.3.1 foreign_0.8-70 htmlwidgets_1.2
[28] RCurl_1.95-4.10 bit_1.1-14 munsell_0.5.0
[31] DelayedArray_0.6.1 compiler_3.5.0 pkgconfig_2.0.1
[34] base64enc_0.1-3 htmltools_0.3.6 nnet_7.3-12
[37] SummarizedExperiment_1.10.1 tibble_1.4.2 gridExtra_2.3
[40] htmlTable_1.12 GenomeInfoDbData_1.1.0 Hmisc_4.1-1
[43] matrixStats_0.53.1 XML_3.98-1.11 bitops_1.0-6
[46] grid_3.5.0 xtable_1.8-2 gtable_0.2.0
[49] DBI_1.0.0 magrittr_1.5 scales_0.5.0
[52] stringi_1.2.3 XVector_0.20.0 genefilter_1.62.0
[55] latticeExtra_0.6-28 Formula_1.2-3 RColorBrewer_1.1-2
[58] tools_3.5.0 bit64_0.9-7 DESeq2_1.20.0
[61] yaml_2.1.19 survival_2.42-4 colorspace_1.3-2
[64] cluster_2.0.7-1 GenomicRanges_1.32.3 memoise_1.1.0
[67] knitr_1.20
Another alternative would be to map your Maize DB IDs to NCBI GI numbers, and then you could use the AnnotationHub OrgDb directly with GOstats, which does include facilities to get the genes that are significant for each GO term. That would probably be much less work.
Thank you James for your feedback.
The original file of the new GO annotation lists all the GO terms associated to each gene, as shown below. To assign the gene names to my significant results I have to first parse out from this annotation the genes (and their GO terms) for each of my lists of DEGs and then use this subset to assign gene names to the go terms in my TopGO results.
I am still intrigued why this is not possible from inside TopGO...or is it and I just missed a turn?
It's not possible because that's not how the function was written. When people write code they envision a particular use case, and then write the function to fulfill that use case.
printGenes
was written specifically to extract the gene annotation data from aChipDb
package, and if you give it something different from that, it will fail, because the code isn't written to take any arbitrary input and 'do the right thing' with it. Your question is sort of like asking why you can't open an Excel document with Word. It's because that's not how it works.