Question: goseq: not able to limit by GO categories when using wallenius approximation for over-representation test
0
gravatar for mrodrigues.fernanda
3.6 years ago by
University of Illinois, Urbana-Champaign
mrodrigues.fernanda10 wrote:

Hi Guys

Here I am again with another question about goseq.

I am following the goseq tutorial. I am manually inserting my length and go data, since the combination o fmy species genome and gene ID is not supported by goseq. 

I ran the following code, and it ran fine as you can see in my output:

GO.wall <- goseq(pwf, gene2cat = Biomart.GOdata)
head(GO.wall)

#        category over_represented_pvalue under_represented_pvalue numDEInCat numInCat                                     term ontology
# 962  GO:0003735            6.729949e-07                0.9999998         29      145       structural constituent of ribosome       MF
# 1820 GO:0005840            6.614389e-06                0.9999978         28      157                                 ribosome       CC
# 8793 GO:0055114            7.380059e-06                0.9999970         38      381              oxidation-reduction process       BP
# 2690 GO:0008137            1.940115e-04                0.9999737          8       25 NADH dehydrogenase (ubiquinone) activity       MF
# 6763 GO:0042773            2.154811e-04                0.9999943          4        6 ATP synthesis coupled electron transport       BP
# 3671 GO:0016020            3.352178e-04                0.9997780        131     2317                                 membrane       CC 

However, when I try to limit by GO category ("GO:MF" for example), it keeps accounting for all GO categories in the calculation, giving the same result as above instead of limiting it.

GO.MF = goseq(pwf, gene2cat = Biomart.GOdata, test.cats = c("GO:MF"))
head(GO.MF)

## same output as above, including BP, CC and MF instead of MF only.

Has anybody had the same problem before?

Any help is appreciated!

sessionInfo()

R version 3.2.4 Revised (2016-03-16 r70336)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.4 LTS

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8  
[6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C           
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] GenomicFeatures_1.22.13 biomaRt_2.26.1          goseq_1.22.0            geneLenDataBase_1.6.0   BiasedUrn_1.07          GOstats_2.36.0        
[7] graph_1.48.0            Category_2.36.0         GO.db_3.2.2             Matrix_1.2-4            org.Bt.eg.db_3.2.3      AnnotationDbi_1.32.3  
[13] edgeR_3.12.0            limma_3.26.9            affycoretools_1.42.0    Biobase_2.30.0          WGCNA_1.51              RSQLite_1.0.0         
[19] DBI_0.3.1               fastcluster_1.1.16      dynamicTreeCut_1.63-1   rtracklayer_1.30.3      GenomicRanges_1.22.4    GenomeInfoDb_1.6.3    
[25] IRanges_2.4.8           S4Vectors_0.8.11        BiocGenerics_0.16.1    

loaded via a namespace (and not attached):
[1] nlme_3.1-126               bitops_1.0-6               matrixStats_0.50.1         doParallel_1.0.10          RColorBrewer_1.1-2       
[6] gcrma_2.42.0               tools_3.2.4                affyio_1.40.0              KernSmooth_2.23-15         rpart_4.1-10             
[11] mgcv_1.8-12                Hmisc_3.17-2               colorspace_1.2-6           nnet_7.3-12                gridExtra_2.2.1          
[16] GGally_1.0.1               DESeq2_1.10.1              bit_1.1-12                 preprocessCore_1.32.0      ggbio_1.18.5             
[21] caTools_1.17.1             scales_0.4.0               genefilter_1.52.1          affy_1.48.0                RBGL_1.46.0              
[26] stringr_1.0.0              Rsamtools_1.22.0           foreign_0.8-66             R.utils_2.2.0              AnnotationForge_1.12.2   
[31] XVector_0.10.0             dichromat_2.0-0            BSgenome_1.38.0            PFAM.db_3.2.2              impute_1.44.0            
[36] BiocInstaller_1.20.1       hwriter_1.3.2              gtools_3.5.0               BiocParallel_1.4.3         acepack_1.3-3.3          
[41] R.oo_1.20.0                VariantAnnotation_1.16.4   RCurl_1.95-4.8             magrittr_1.5               Formula_1.2-1            
[46] oligoClasses_1.32.0        futile.logger_1.4.1        Rcpp_0.12.3                munsell_0.4.3              R.methodsS3_1.7.1        
[51] stringi_1.0-1              SummarizedExperiment_1.0.2 zlibbioc_1.16.0            gplots_2.17.0              plyr_1.8.3               
[56] grid_3.2.4                 gdata_2.17.0               ReportingTools_2.10.0      lattice_0.20-33            Biostrings_2.38.4        
[61] splines_3.2.4              annotate_1.48.0            locfit_1.5-9.1             knitr_1.12.3               geneplotter_1.48.0       
[66] reshape2_1.4.1             codetools_0.2-14           futile.options_1.0.0       XML_3.98-1.4               RcppArmadillo_0.6.600.4.0
[71] biovizBase_1.18.0          latticeExtra_0.6-28        lambda.r_1.1.7             foreach_1.4.3              gtable_0.2.0             
[76] reshape_0.8.5              ggplot2_2.1.0              xtable_1.8-2               ff_2.2-13                  survival_2.38-3          
[81] OrganismDbi_1.12.1         iterators_1.0.8            GenomicAlignments_1.6.3    cluster_2.0.3              GSEABase_1.32.0     

 

goseq • 933 views
ADD COMMENTlink modified 3.6 years ago by Nadia Davidson300 • written 3.6 years ago by mrodrigues.fernanda10
Answer: goseq: not able to limit by GO categories when using wallenius approximation for
3
gravatar for Nadia Davidson
3.6 years ago by
Australia
Nadia Davidson300 wrote:

Hello again,

The test.cats option only works when you use the gene2cat mapping that is native to goseq. If you supply your own categories through gene2cat this option won't work (mostly because it's often not GO groupings that are passed to this option). If you want to limit the analysis to MF for example, you could filter Biomart.GOdata for just MF terms and then pass that to gene2cat.

Cheers, Nadia.

ADD COMMENTlink written 3.6 years ago by Nadia Davidson300

Thank you for the clarification!!

ADD REPLYlink written 3.6 years ago by mrodrigues.fernanda10

Would it make sense to run the analysis with all terms (including all groupings) then filter them afterwards, just before the multiple testing? Or will this affect anything statistically?

ie:

GO_Age_dwn<-goseq(pwfAge_dwn, gene2cat = CukeGO)
GO_Age_dwn_BP<-subset(GO_Age_dwn, ontology == "BP")

 

ADD REPLYlink written 3.4 years ago by Ben Mansfeld0

Hi Ben,

The only thing it will affect (with regards to filtering before or after) is the probability weighting function, if you filter before creating the pwf, the pwf will be made based only on genes for that ontology (BP) if you filter after it will construct the pwf based on all genes. The difference should be be quite small though, and i doubt would make a large difference either way around.

Cheers,

Anthony

ADD REPLYlink written 3.4 years ago by anthony.hawkins10

Thanks for the quick reply. 

I wonder then why while using a supported genome, filtering by ontology is done in the goseq function which comes after the pwf function. Does nullp even take the GO in to account? I wasn't suggesting filtering the genes by ontology, rather subsetting the results of goseq, as in my code.

I guess what I see as a downstream issue is multiple testing. If one were to run FDR correction on all ontologies vs only one that would change the number of pvals corrected thus affecting the adjusted p-values.

ADD REPLYlink written 3.4 years ago by Ben Mansfeld0
1

You are fine to do as you suggested (run the pwf function first then subset, as your code implies). That is the default because in that way you use all genes in your data to build the pwf (rather than just the ones from your specific GO term).

As long as you make your decision about which GO terms to test *before* looking at the resulting p-values from GOseq, you are fine to subset on just one GO term before running your multiple testing correction (which will of course give you smaller adjusted p-values than if you had still kept in all the GO terms).

 

So bottom line, what you suggest will work fine

ADD REPLYlink written 3.4 years ago by anthony.hawkins10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 285 users visited in the last hour