Question

GOSEQ: could not find any categories for many genes

1

Entering edit mode

Vivek.b ▴ 100

@vivekb-7661

Last seen 4.2 years ago

Germany

Hello everyone.
I recently started using GOSEQ for the analysis of RNA-Seq results from DESeq2. I have around 37K total genes with ~3K differentialy expressed. I am using mm9 with ensGene ids to run goseq command. It works fine, however gives me a warning :

For 16675 genes, we could not find any categories. These genes will be excluded.
To force their use, please run with use_genes_without_cat=TRUE (see documentation).
This was the default behavior for version 1.15.1 and earlier.

I would like to know whether excluding these genes has any effect on testing? And whether I should use the use_genes_without_cat=TRUE option to avoid this..

my R version is : 3.1.3 (2015-03-09), and GOSEQ verison is: goseq_1.16.2

Thanks

goseq mm9 • 2.6k views

ADD COMMENT • link 9.3 years ago Vivek.b ▴ 100

score 3 · Answer 1 · 2015-04-26

Hi,

Ensembl can have a lot of non-coding and other genes for which there's not yet any functional annotation, and I think this is why so many genes are missing a GO category in the database. You still have about 20K genes with a GO annotation to do the analysis, which is fine.

I'd recommend using the default settings, as the use_genes_without_cat=TRUE option is provided mostly so users can reproduce their results from goseq version 1.15.1 and lower. Goseq before version 1.15.2 treated the genes without a GO term as if they had a GO term different from any of the ones being tested. This is not really the correct behaviour since they probably are associated with one of the terms being tested, but we just don't know what term that is. What tends to happen in this case is that the root GO terms come up as significant, which is not really desirable. The default option from 1.15.2 onwards instead ignores the unknown genes altogether and only uses those with GO annotation.

If you're unsure, you can always try both ways and see what happens.

Cheers,

Nadia.

score 1 · Answer 2 · 2015-04-26

Gene Ontology annotation tends to be gene orientated, and there are no more than about 25k genes in the human genome. The ensGene ids represent transcripts rather than genes, and it is not surprising that specific GO annotation does not exist for many of them. Personally, I would consider mapping the Ensembl IDs to gene symbols and redoing the GO analysis with genewise results. However, it is likely that your ensGene list already includes a representative transcript for most genes, so goseq may already be giving you a result similar to what you would get that way.

score 0 · Answer 3 · 2015-04-27

0