Hi Guys
Here I am again with another question about goseq.
I am following the goseq tutorial. I am manually inserting my length and go data, since the combination o fmy species genome and gene ID is not supported by goseq.
I ran the following code, and it ran fine as you can see in my output:
GO.wall <- goseq(pwf, gene2cat = Biomart.GOdata)
head(GO.wall)
# category over_represented_pvalue under_represented_pvalue numDEInCat numInCat term ontology
# 962 GO:0003735 6.729949e-07 0.9999998 29 145 structural constituent of ribosome MF
# 1820 GO:0005840 6.614389e-06 0.9999978 28 157 ribosome CC
# 8793 GO:0055114 7.380059e-06 0.9999970 38 381 oxidation-reduction process BP
# 2690 GO:0008137 1.940115e-04 0.9999737 8 25 NADH dehydrogenase (ubiquinone) activity MF
# 6763 GO:0042773 2.154811e-04 0.9999943 4 6 ATP synthesis coupled electron transport BP
# 3671 GO:0016020 3.352178e-04 0.9997780 131 2317 membrane CC
However, when I try to limit by GO category ("GO:MF" for example), it keeps accounting for all GO categories in the calculation, giving the same result as above instead of limiting it.
GO.MF = goseq(pwf, gene2cat = Biomart.GOdata, test.cats = c("GO:MF")) head(GO.MF) ## same output as above, including BP, CC and MF instead of MF only.
Has anybody had the same problem before?
Any help is appreciated!
sessionInfo() R version 3.2.4 Revised (2016-03-16 r70336) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 14.04.4 LTS locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 [6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets methods base other attached packages: [1] GenomicFeatures_1.22.13 biomaRt_2.26.1 goseq_1.22.0 geneLenDataBase_1.6.0 BiasedUrn_1.07 GOstats_2.36.0 [7] graph_1.48.0 Category_2.36.0 GO.db_3.2.2 Matrix_1.2-4 org.Bt.eg.db_3.2.3 AnnotationDbi_1.32.3 [13] edgeR_3.12.0 limma_3.26.9 affycoretools_1.42.0 Biobase_2.30.0 WGCNA_1.51 RSQLite_1.0.0 [19] DBI_0.3.1 fastcluster_1.1.16 dynamicTreeCut_1.63-1 rtracklayer_1.30.3 GenomicRanges_1.22.4 GenomeInfoDb_1.6.3 [25] IRanges_2.4.8 S4Vectors_0.8.11 BiocGenerics_0.16.1 loaded via a namespace (and not attached): [1] nlme_3.1-126 bitops_1.0-6 matrixStats_0.50.1 doParallel_1.0.10 RColorBrewer_1.1-2 [6] gcrma_2.42.0 tools_3.2.4 affyio_1.40.0 KernSmooth_2.23-15 rpart_4.1-10 [11] mgcv_1.8-12 Hmisc_3.17-2 colorspace_1.2-6 nnet_7.3-12 gridExtra_2.2.1 [16] GGally_1.0.1 DESeq2_1.10.1 bit_1.1-12 preprocessCore_1.32.0 ggbio_1.18.5 [21] caTools_1.17.1 scales_0.4.0 genefilter_1.52.1 affy_1.48.0 RBGL_1.46.0 [26] stringr_1.0.0 Rsamtools_1.22.0 foreign_0.8-66 R.utils_2.2.0 AnnotationForge_1.12.2 [31] XVector_0.10.0 dichromat_2.0-0 BSgenome_1.38.0 PFAM.db_3.2.2 impute_1.44.0 [36] BiocInstaller_1.20.1 hwriter_1.3.2 gtools_3.5.0 BiocParallel_1.4.3 acepack_1.3-3.3 [41] R.oo_1.20.0 VariantAnnotation_1.16.4 RCurl_1.95-4.8 magrittr_1.5 Formula_1.2-1 [46] oligoClasses_1.32.0 futile.logger_1.4.1 Rcpp_0.12.3 munsell_0.4.3 R.methodsS3_1.7.1 [51] stringi_1.0-1 SummarizedExperiment_1.0.2 zlibbioc_1.16.0 gplots_2.17.0 plyr_1.8.3 [56] grid_3.2.4 gdata_2.17.0 ReportingTools_2.10.0 lattice_0.20-33 Biostrings_2.38.4 [61] splines_3.2.4 annotate_1.48.0 locfit_1.5-9.1 knitr_1.12.3 geneplotter_1.48.0 [66] reshape2_1.4.1 codetools_0.2-14 futile.options_1.0.0 XML_3.98-1.4 RcppArmadillo_0.6.600.4.0 [71] biovizBase_1.18.0 latticeExtra_0.6-28 lambda.r_1.1.7 foreach_1.4.3 gtable_0.2.0 [76] reshape_0.8.5 ggplot2_2.1.0 xtable_1.8-2 ff_2.2-13 survival_2.38-3 [81] OrganismDbi_1.12.1 iterators_1.0.8 GenomicAlignments_1.6.3 cluster_2.0.3 GSEABase_1.32.0
Thank you for the clarification!!
Would it make sense to run the analysis with all terms (including all groupings) then filter them afterwards, just before the multiple testing? Or will this affect anything statistically?
ie:
Hi Ben,
The only thing it will affect (with regards to filtering before or after) is the probability weighting function, if you filter before creating the pwf, the pwf will be made based only on genes for that ontology (BP) if you filter after it will construct the pwf based on all genes. The difference should be be quite small though, and i doubt would make a large difference either way around.
Cheers,
Anthony
Thanks for the quick reply.
I wonder then why while using a supported genome, filtering by ontology is done in the goseq function which comes after the pwf function. Does nullp even take the GO in to account? I wasn't suggesting filtering the genes by ontology, rather subsetting the results of goseq, as in my code.
I guess what I see as a downstream issue is multiple testing. If one were to run FDR correction on all ontologies vs only one that would change the number of pvals corrected thus affecting the adjusted p-values.
You are fine to do as you suggested (run the pwf function first then subset, as your code implies). That is the default because in that way you use all genes in your data to build the pwf (rather than just the ones from your specific GO term).
As long as you make your decision about which GO terms to test *before* looking at the resulting p-values from GOseq, you are fine to subset on just one GO term before running your multiple testing correction (which will of course give you smaller adjusted p-values than if you had still kept in all the GO terms).
So bottom line, what you suggest will work fine