Question: topGO: GO terms in gene annotation do not match significant GO term output
0
6 months ago by
jgg0
jgg0 wrote:

Hi!

I'm having trouble establishing a comparison between GO terms in the original annotation of the gene subset chosen for enrichment and significant GO terms in the GenTable output of topGO

When I retrieve the gene IDs annotated with each GO term using genesInTerm(), their original annotation does not contain the specific GO term in the GenTable() output they are associated with.

For instance,

running genesInTerm(GOdata.2, "GO:0015175") shows me there are 28 genes with that GO term

However, in the original annotation file used for readMappings(), most of those 28 genes are nor annotated with "GO:0015175".

This is their MP annotation:

 1 GO:0004190;GO:0015194 2 GO:0004190;GO:0015194 3 GO:0015194 4 GO:0015194 5 GO:0015194 6 GO:0015194 7 GO:0015297;GO:0015180;GO:0015171;GO:0015189;GO:0005313;GO:0015179;GO:0015181;GO:0015185 8 GO:0015180;GO:0015297;GO:0015171;GO:0015181;GO:0015185;GO:0005313;GO:0015189;GO:0015179 9 GO:0015189;GO:0005313;GO:0015179;GO:0015181;GO:0015185;GO:0015297;GO:0015180;GO:0015171 10 GO:0005313;GO:0015189;GO:0015179;GO:0015181;GO:0015185;GO:0015180;GO:0015297;GO:0015171 11 GO:0015180;GO:0015297;GO:0015171;GO:0005313;GO:0015189;GO:0015179;GO:0015185;GO:0015181 12 GO:0015185;GO:0015181;GO:0005313;GO:0015189;GO:0015179;GO:0015180;GO:0015297;GO:0015171 13 GO:0015185;GO:0015181;GO:0015179;GO:0015189;GO:0005313;GO:0015171;GO:0015297;GO:0015180 14 GO:0015185;GO:0015181;GO:0015179;GO:0005313;GO:0015189;GO:0015171;GO:0015180;GO:0015297 15 GO:0015185;GO:0015181;GO:0015179;GO:0015189;GO:0005313;GO:0015171;GO:0015297;GO:0015180 16 GO:0015179;GO:0005313;GO:0015189;GO:0015181;GO:0015185;GO:0015171;GO:0015180;GO:0015297 17 GO:0015181;GO:0015185;GO:0015179;GO:0015189;GO:0005313;GO:0015171;GO:0015297;GO:0015180 18 GO:0015185;GO:0015181;GO:0005313;GO:0015189;GO:0015179;GO:0015180;GO:0015297;GO:0015171 19 GO:0015293;GO:0015399;GO:0015175;GO:0015194;GO:0015172;GO:0015171;GO:0016756;GO:0015193;GO:0015186;GO:0005313;GO:0015180 20 GO:0015171;GO:0016756;GO:0015180;GO:0005313;GO:0015193;GO:0015186;GO:0015194;GO:0015172;GO:0015175;GO:0015293;GO:0015399 21 GO:0015171;GO:0015172;GO:0015175;GO:0015293;GO:0015399 22 GO:0015175;GO:0015293;GO:0015180;GO:0015174;GO:0015193;GO:0015186;GO:0005313;GO:0015171;GO:0015172;GO:0015194 23 GO:0015194;GO:0015172;GO:0015171;GO:0016756;GO:0015180;GO:0005313;GO:0015193;GO:0015186;GO:0015293;GO:0015399;GO:0015175 24 GO:0015175;GO:0015293;GO:0015171;GO:0015172 25 GO:0015293;GO:0015194;GO:0016756;GO:0015171;GO:0015180;GO:0015193;GO:0005313;GO:0015186 26 GO:0015293;GO:0015175;GO:0015172;GO:0015194;GO:0015171;GO:0015180;GO:0015174;GO:0015186;GO:0015193;GO:0005313 27 GO:0015172;GO:0015171;GO:0015174;GO:0015293;GO:0015399;GO:0015175 28 GO:0015171;GO:0016756;GO:0015180;GO:0015193;GO:0005313;GO:0015186;GO:0015194;GO:0015293

Why are genes mapped to a particular GO term they are not annotated with? Is this a result of the hierarchy of terms?

I would appreciate any help in understanding what is happening.

Thanks a lot for your help,

Cheers!

Joana

rnaseq topgo go enrichment • 179 views
modified 6 months ago by James W. MacDonald50k • written 6 months ago by jgg0

Could it be that these genes are child terms of the GO term of your interest? I don't remember the  code nor the documentation...

I think that is probably the issue here, but I just wanted to make sure as I'm guessing other people might have had the same problem. Unfortunately, I cannot check that this is the case for all l GOterms and genes in my list.

Joana

Answer: topGO: GO terms in gene annotation do not match significant GO term output
0
6 months ago by
United States
James W. MacDonald50k wrote:

It's not clear what an MP mapping might be, nor how you got that set of GO IDs, so it's difficult to say if there is a problem here or not. You should follow the posting guide, which asks for, at minimum, a self-contained example that people can run.

I can make a wild guess however, that you are misunderstanding how the GO DAG works, and how topGO does the analysis. If you have a gene that has a GO term (here a fake one I just made up) of Map Kinase Phosphorylation, to which it is directly appended, by default that gene is also appended to all ancestor GO terms for that term as well. So continuing on with my example, say the parent term for Map Kinase Phosphorylation is Kinase Phosphorylation, and that term's parent is Protein Phosphorylation, and that term's parent is Phosphorylation.

The gene in this example is directly appended to Map Kinase Phosphorylation, but by inference, since all ancestor terms are supersets of the offspring terms, it is also appended to all ancestor terms as well. If you map a gene ID to its direct GO term using something like

select(OrgDb, ID, "GO", "ENTREZID")

you don't get all the terms that it is appended to, but instead just get the direct terms. You would need to do something like

select(OrgDb, ID, "GOALL", "ENTREZID")

in order to get all the direct GO terms as well as those that are inferred by being ancestors to those terms.

Here is the example code for the output I mentioned.

Does it make more sense now? Thanks a lot for your help, Joana

####################################################################

library(topGO)

# Input files and gene IDs and background annotation with Sma3s format ----

# Get gene IDs for the enrichment

genes.UP <- read.csv("sigUP.csv", header=F)$V1 # data frame with the annotations of all genes in "genes.UP" UP.annotated <- read.csv("pathTo_UP_AnnotationFile.csv", header = T) # annotation file for all genes in the transcriptome # which will be subsetted to keep only the gene IDs and relevant GOterm annotations # eg: ATR45|c0_g1_i1 GO:0016021, GO:0030176, GO:0031090, GO:0005789 annot.file <- read.csv("pathTo_ALLgenes_AnnotationFile.tsv", sep = "\t") Go.F <- annot.file[,c(1,7)] Go.F$GO.F.ID <- str_replace_all(Go.F\$GO.F.ID, ";", ", ")

write.table(Go.F, file="GoF", row.names = F, col.names = F, quote = F, sep="\t")

# Go enrichment by MF (Molecular Function) ----

bg_genes <- names(GOesByF)

# Compare genes vs bg_genes

compared_genes <- factor(as.integer(bg_genes %in% genes.UP))

bg_genes <- names(compared_genes)

# create topGO object ----

GOdata.2 <- new("topGOdata", ontology = "MF",

allGenes = compared_genes,

annot = annFUN.gene2GO, gene2GO = GOesByF)

# Run Fisher test ---

resultFisher <- runTest(GOdata.2, algorithm = "classic", statistic = "fisher")

# create  table with enrichment result ----

allRes.byF.UP <- GenTable(GOdata.2, classicFisher = resultFisher, orderBy = "classicFisher", topNodes = 50)

# get all genes with a specific GOterm

genes <- genesInTerm(GOdata.2, "GO:0015175")

# subset those genes to get only those in my "UP" list

genes[[1]][genes[[1]] %in% genes.UP]  # this will output the 28 genes I mentioned in my first post, whose annotation in the mapping file "GoF" does not necessarily match "GO:0015175"

2

That's not helpful at all. You have a file (that nobody else has) and you show that you read it in and do some stuff with it, and you don't get the results you expect. Perhaps you missed the part where I said 'at a minimum you need to provide a self-contained example that people can run'? Unless people can run the example for themselves, there is no way for anybody to help.

But again, taking a flyer on what you might need, do note that you can get the upstream IDs from GO.db, using a SQL query

> con <- GO_dbconn()
> dbGetQuery(con, "select _id, go_id, term from go_term where go_id='GO:0015175';")
_id      go_id                                                  term
1 9743 GO:0015175 neutral amino acid transmembrane transporter activity
> dbGetQuery(con, "select  go_id, term from go_term inner join go_mf_parents using(_id) where go_mf_parents._parent_id='9743';")
go_id                                                                     term
1  GO:0005304                              L-valine transmembrane transporter activity
2  GO:0015188                          L-isoleucine transmembrane transporter activity
3  GO:0001761                          beta-alanine transmembrane transporter activity
4  GO:0005294 neutral L-amino acid secondary active transmembrane transporter activity
5  GO:0005295                             neutral amino acid:sodium symporter activity
6  GO:0015182                          L-asparagine transmembrane transporter activity
7  GO:0015186                           L-glutamine transmembrane transporter activity
8  GO:0015187                               glycine transmembrane transporter activity
9  GO:0015190                             L-leucine transmembrane transporter activity
10 GO:0015193                             L-proline transmembrane transporter activity
11 GO:0015195                           L-threonine transmembrane transporter activity
12 GO:0022858                               alanine transmembrane transporter activity
13 GO:0022889                                serine transmembrane transporter activity
14 GO:0033229                              cysteine transmembrane transporter activity
15 GO:0042970                            homoserine transmembrane transporter activity

And if ANY of those GO terms are appended to the gene you care about, then by definition 'neutral amino acid transmembrane transporter activity' is appended as well.