Question: GOseq No Underrepresented Categories
gravatar for mart1139
17 months ago by
Purdue University
mart11390 wrote:

Hi All,

I am using GOseq as part of the Trinity pipeline to perform gene ontology testing for genes in a de novo transcriptome assembly. I used edgeR to perform DGE analysis comparing groups that were treated with a chemical to those that went untreated, and my output looks like the following:

> dim(go.df)
[1] 50 11
> head(go.df)
                     Gene sampleA   sampleB    logFC     logCPM       LR   PValue          FDR Transcriptome tissue
1  TRINITY_DN165600_c9_g1 control treatment 9.785901  1.3670613 25.66781 4.06e-07 8.839530e-04      meta_ge1  brain
2 TRINITY_DN169366_c12_g1 control treatment 9.574774  3.7314640 23.72559 1.11e-06 2.034114e-03      meta_ge1  brain
3  TRINITY_DN236596_c2_g2 control treatment 8.619980  2.9329807 21.75304 3.10e-06 5.069438e-03      meta_ge1  brain
4  TRINITY_DN138776_c1_g1 control treatment 8.542091  3.4077197 58.47233 2.06e-14 4.720000e-10      meta_ge1  brain
5  TRINITY_DN217803_c2_g1 control treatment 8.530312  0.1282462 31.91309 1.61e-08 5.680000e-05      meta_ge1  brain
6   TRINITY_DN85324_c0_g1 control treatment 7.463319 -0.7732755 26.05991 3.31e-07 8.120070e-04      meta_ge1  brain
1         LM
2         LM
3         LM
4         LM
5         LM
6         LM

In total, I had a very small number of DEGs (50) for this comparison and I have both positive and negative LogFC values. My transcriptome is large (composed of roughly 1.3 million transcripts) with ~83,000 transcripts having associated GO terms. After reading in the appropriate files (gene lengths, matrix containing background gene IDs, GO annotations from the Trinotate pipeline), I ran GOseq and found a good number (67) of over-represented GO categories, but I had zero under-represented categories, which makes me think something went awry in my R script. My general question is this: is it reasonable to expect to have a large number of over-represented categories and few to no under-represented categories? 

Admittedly, I am new to using GOseq and GO term analysis and I would greatly appreciate any insight on this issue. Below is a section of my code where the over- and under-represented categories are decided. I am also happy to provide the entire block of code (154 lines) if the section below isn't sufficient.



# perform functional enrichment testing for each category.

filename = c("enriched","depleted")

for (feature_cat in factor_list) {

  message('Processing category: ', feature_cat)

  gene_ids_in_feature_cat = rownames(factor_labeling)[factor_labeling$type == feature_cat]

  cat_genes_vec = as.integer(sample_set_gene_ids %in% gene_ids_in_feature_cat)

  pwf$DEgenes = cat_genes_vec

  res = goseq(pwf,gene2cat=GO_info_listed, use_genes_without_cat=TRUE)

  ## over-represented categories:

  pvals = res$over_represented_pvalue

  pvals[pvals > 1 - 1e-10] = 1 - 1e-10

  q = qvalue(pvals)

  res$over_represented_FDR = q$qvalues

  go_enrich_filename = paste("/Users/alexandermartinez/Desktop/Job_files&Scripts/Lamprey_scripts/Round2_extractions/annotation/",loop.list[[i]][1],'_',loop.list[[i]][2],'.GOseq.enriched.txt', sep='')

  result_table = res[res$over_represented_pvalue<=0.05,]

  descr = unlist(lapply(result_table$category, get_GO_term_descr))

  result_table$go_term = descr;

  result_table$gene_ids =, lapply(result_table$category, function(x) { 

    gene_list = GO_to_gene_list[[x]]

    gene_list = gene_list[gene_list %in% rownames(factor_labeling)]

    paste(gene_list, collapse=', ');

  }) )

  write.table(result_table[order(result_table$over_represented_pvalue),], file=go_enrich_filename, sep=' ', quote=F, row.names=F)

  #search GO database to get more information for GO terms in enriched or depleted database; send it to text file

  sink(paste("/Users/alexandermartinez/Desktop/Job_files&Scripts/Lamprey_scripts/Round2_extractions/annotation/",loop.list[[i]][1],'_',loop.list[[i]][2],"_","GOterms_treatment_enriched_description.txt", sep = ""))

  for(go in result_table$category){





  ## under-represented categories:

  pvals = res$under_represented_pvalue

  pvals[pvals>1-1e-10] = 1 - 1e-10

  q = qvalue(pvals)

  res$under_represented_FDR = q$qvalues

  go_depleted_filename = paste("/Users/alexandermartinez/Desktop/Job_files&Scripts/Lamprey_scripts/Round2_extractions/annotation/",loop.list[[i]][1],'_',loop.list[[i]][2], '.GOseq.depleted.txt', sep='')

  result_table = res[res$under_represented_pvalue<=0.05,]

  descr = unlist(lapply(result_table$category, get_GO_term_descr))

  result_table$go_term = descr;

  write.table(result_table[order(result_table$under_represented_pvalue),], file=go_depleted_filename, sep=' ', quote=F, row.names=F)

  #search GO database to get more information for GO terms in enriched or depleted database; send it to text file

  sink(paste("/Users/alexandermartinez/Desktop/Job_files&Scripts/Lamprey_scripts/Round2_extractions/annotation/",loop.list[[i]][1],'_',loop.list[[i]][2],"_","GOterms_treatment_depleted_description.txt", sep = ""))

  for(go in result_table$category){





> sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.5

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] qvalue_2.8.0               GO.db_3.4.1                AnnotationDbi_1.38.2       goseq_1.28.0              
 [5] geneLenDataBase_1.12.0     BiasedUrn_1.07             colorspace_1.3-2           heatmap3_1.1.1            
 [9] Glimma_1.4.0               ape_5.0                    gplots_3.0.1               ctc_1.50.0                
[13] amap_0.8-14                DESeq2_1.16.1              SummarizedExperiment_1.6.5 DelayedArray_0.2.7        
[17] matrixStats_0.53.1         Biobase_2.36.2             GenomicRanges_1.28.6       GenomeInfoDb_1.12.3       
[21] IRanges_2.10.5             S4Vectors_0.14.7           BiocGenerics_0.22.1        edgeR_3.18.1              
[25] limma_3.32.10
goseq go enrichment • 241 views
ADD COMMENTlink modified 17 months ago • written 17 months ago by mart11390
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 335 users visited in the last hour