goseq: not able to limit by GO categories when using wallenius approximation for over-representation test
Entering edit mode
Last seen 5.4 years ago
University of Illinois, Urbana-Champaign

Hi Guys

Here I am again with another question about goseq.

I am following the goseq tutorial. I am manually inserting my length and go data, since the combination o fmy species genome and gene ID is not supported by goseq. 

I ran the following code, and it ran fine as you can see in my output:

GO.wall <- goseq(pwf, gene2cat = Biomart.GOdata)

#        category over_represented_pvalue under_represented_pvalue numDEInCat numInCat                                     term ontology
# 962  GO:0003735            6.729949e-07                0.9999998         29      145       structural constituent of ribosome       MF
# 1820 GO:0005840            6.614389e-06                0.9999978         28      157                                 ribosome       CC
# 8793 GO:0055114            7.380059e-06                0.9999970         38      381              oxidation-reduction process       BP
# 2690 GO:0008137            1.940115e-04                0.9999737          8       25 NADH dehydrogenase (ubiquinone) activity       MF
# 6763 GO:0042773            2.154811e-04                0.9999943          4        6 ATP synthesis coupled electron transport       BP
# 3671 GO:0016020            3.352178e-04                0.9997780        131     2317                                 membrane       CC 

However, when I try to limit by GO category ("GO:MF" for example), it keeps accounting for all GO categories in the calculation, giving the same result as above instead of limiting it.

GO.MF = goseq(pwf, gene2cat = Biomart.GOdata, test.cats = c("GO:MF"))

## same output as above, including BP, CC and MF instead of MF only.

Has anybody had the same problem before?

Any help is appreciated!


R version 3.2.4 Revised (2016-03-16 r70336)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.4 LTS

[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8  
[6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C           

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] GenomicFeatures_1.22.13 biomaRt_2.26.1          goseq_1.22.0            geneLenDataBase_1.6.0   BiasedUrn_1.07          GOstats_2.36.0        
[7] graph_1.48.0            Category_2.36.0         GO.db_3.2.2             Matrix_1.2-4            org.Bt.eg.db_3.2.3      AnnotationDbi_1.32.3  
[13] edgeR_3.12.0            limma_3.26.9            affycoretools_1.42.0    Biobase_2.30.0          WGCNA_1.51              RSQLite_1.0.0         
[19] DBI_0.3.1               fastcluster_1.1.16      dynamicTreeCut_1.63-1   rtracklayer_1.30.3      GenomicRanges_1.22.4    GenomeInfoDb_1.6.3    
[25] IRanges_2.4.8           S4Vectors_0.8.11        BiocGenerics_0.16.1    

loaded via a namespace (and not attached):
[1] nlme_3.1-126               bitops_1.0-6               matrixStats_0.50.1         doParallel_1.0.10          RColorBrewer_1.1-2       
[6] gcrma_2.42.0               tools_3.2.4                affyio_1.40.0              KernSmooth_2.23-15         rpart_4.1-10             
[11] mgcv_1.8-12                Hmisc_3.17-2               colorspace_1.2-6           nnet_7.3-12                gridExtra_2.2.1          
[16] GGally_1.0.1               DESeq2_1.10.1              bit_1.1-12                 preprocessCore_1.32.0      ggbio_1.18.5             
[21] caTools_1.17.1             scales_0.4.0               genefilter_1.52.1          affy_1.48.0                RBGL_1.46.0              
[26] stringr_1.0.0              Rsamtools_1.22.0           foreign_0.8-66             R.utils_2.2.0              AnnotationForge_1.12.2   
[31] XVector_0.10.0             dichromat_2.0-0            BSgenome_1.38.0            PFAM.db_3.2.2              impute_1.44.0            
[36] BiocInstaller_1.20.1       hwriter_1.3.2              gtools_3.5.0               BiocParallel_1.4.3         acepack_1.3-3.3          
[41] R.oo_1.20.0                VariantAnnotation_1.16.4   RCurl_1.95-4.8             magrittr_1.5               Formula_1.2-1            
[46] oligoClasses_1.32.0        futile.logger_1.4.1        Rcpp_0.12.3                munsell_0.4.3              R.methodsS3_1.7.1        
[51] stringi_1.0-1              SummarizedExperiment_1.0.2 zlibbioc_1.16.0            gplots_2.17.0              plyr_1.8.3               
[56] grid_3.2.4                 gdata_2.17.0               ReportingTools_2.10.0      lattice_0.20-33            Biostrings_2.38.4        
[61] splines_3.2.4              annotate_1.48.0            locfit_1.5-9.1             knitr_1.12.3               geneplotter_1.48.0       
[66] reshape2_1.4.1             codetools_0.2-14           futile.options_1.0.0       XML_3.98-1.4               RcppArmadillo_0.6.600.4.0
[71] biovizBase_1.18.0          latticeExtra_0.6-28        lambda.r_1.1.7             foreach_1.4.3              gtable_0.2.0             
[76] reshape_0.8.5              ggplot2_2.1.0              xtable_1.8-2               ff_2.2-13                  survival_2.38-3          
[81] OrganismDbi_1.12.1         iterators_1.0.8            GenomicAlignments_1.6.3    cluster_2.0.3              GSEABase_1.32.0     


goseq • 1.4k views
Entering edit mode
Last seen 3.4 years ago

Hello again,

The test.cats option only works when you use the gene2cat mapping that is native to goseq. If you supply your own categories through gene2cat this option won't work (mostly because it's often not GO groupings that are passed to this option). If you want to limit the analysis to MF for example, you could filter Biomart.GOdata for just MF terms and then pass that to gene2cat.

Cheers, Nadia.

Entering edit mode

Thank you for the clarification!!

Entering edit mode

Would it make sense to run the analysis with all terms (including all groupings) then filter them afterwards, just before the multiple testing? Or will this affect anything statistically?


GO_Age_dwn<-goseq(pwfAge_dwn, gene2cat = CukeGO)
GO_Age_dwn_BP<-subset(GO_Age_dwn, ontology == "BP")


Entering edit mode

Hi Ben,

The only thing it will affect (with regards to filtering before or after) is the probability weighting function, if you filter before creating the pwf, the pwf will be made based only on genes for that ontology (BP) if you filter after it will construct the pwf based on all genes. The difference should be be quite small though, and i doubt would make a large difference either way around.



Entering edit mode

Thanks for the quick reply. 

I wonder then why while using a supported genome, filtering by ontology is done in the goseq function which comes after the pwf function. Does nullp even take the GO in to account? I wasn't suggesting filtering the genes by ontology, rather subsetting the results of goseq, as in my code.

I guess what I see as a downstream issue is multiple testing. If one were to run FDR correction on all ontologies vs only one that would change the number of pvals corrected thus affecting the adjusted p-values.

Entering edit mode

You are fine to do as you suggested (run the pwf function first then subset, as your code implies). That is the default because in that way you use all genes in your data to build the pwf (rather than just the ones from your specific GO term).

As long as you make your decision about which GO terms to test *before* looking at the resulting p-values from GOseq, you are fine to subset on just one GO term before running your multiple testing correction (which will of course give you smaller adjusted p-values than if you had still kept in all the GO terms).


So bottom line, what you suggest will work fine


Login before adding your answer.

Traffic: 446 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6