Question

goana and "No annotated genes found in universe" even with Entrez gene ids

1

Entering edit mode

matrs ▴ 10

@matrs-15100

Last seen 4.3 years ago

I'm using edgeR to perform a a RNA-seq differential expression analysis with S. cerevisiae samples and then trying to use goana. My genes ids are from ensembl so I added a column with entrezgene ids to the DGELRT object, which is coming from the glmQLFTest function. I'm not sure where in the DGLRT object I have to add the column with entrezgene ids, in the manual says:

...or the name of the column of de$genes containing the Gene IDs.

But I cannot find which element in the DGELRT object is de$genes, so I added to de$table and I'm passing that to goana. I also tried to add the entregene ids to dgelist_qltest$geneid but goana fails in the same way

> class(dgelist_qltest)
[1] "DGELRT"
attr(,"package")
[1] "edgeR"

> goana.DGELRT(de=dgelist_qltest,  geneid=dgelist_qltest$table$entrezgene_id)
Error in goana.default(de = DEGenes, universe = universe, ...) : 
  No annotated genes found in universe 

> head(dgelist_qltest$table)
             logFC    logCPM          F       PValue entrezgene_id
YAL001C  0.2240038  6.943048 14.1162483 0.0022238429        851262
YAL002W  0.2994297  6.364025 18.9983099 0.0006987048        851261
YAL003W -0.3346512 10.272637 10.8792050 0.0054613104        851260
YAL004W  0.1465868  7.660092  0.8873887 0.3657633848            NA
YAL005C  0.2767830 11.574323  4.7249463 0.0479133341        851259
YAL007C -0.2657464  7.172642  8.8780937 0.0102093078        851226

#When I pass this to goana, fails with the same error: 
> head(dgelist_qltest$geneid)
[1] 851262 851261 851260     NA 851259 851226

Related to this, when I use goana with (or without) the geneid parameter and species=... I get this error:

> goana(de=dgelist_qltest,  geneid=dgelist_qltest$table$entrezgene_id, species='Sc')
Error in goana.default(de = DEGenes, universe = universe, ...) : 
  org.Sc.eg.db package required but not not installed (or can't be loaded)

Where the name of the db isn't found because in the case of S. cerevisiae is org.Sc.sgd.db , which does have "sgd" instead of the "eg" in the name.

goana edger • 6.0k views

ADD COMMENT • link updated 3.1 years ago by Gordon Smyth 52k • written 5.6 years ago by matrs ▴ 10

score 3 · Accepted Answer · 2019-07-25

3

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 1 day ago

United States

Your DGELRT object dgelist_qltest is a list, with a genes item that you can put things in. In general when you make the DGEList to begin with, you would put all the annotation in at that point. You could do something like

## here I am assuming that your counts were in an object called 'counts'
## with SGD IDs as the row names, which seems to be a reasonable assumption
annot <- select(org.Sc.sgd.db, row.names(counts), "ENTREZID")

dgelist <- DGEList(counts, genes = annot)

And then in your call to goana you would just use the argument geneid = "ENTREZID"

ADD COMMENT • link 5.6 years ago James W. MacDonald 68k

0

Entering edit mode

I should note that there appears to be a 1-1 correspondence between SGD and Gene IDs:

> z <- select(org.Sc.sgd.db, keys(org.Sc.sgd.db), "ENTREZID")
'select()' returned 1:1 mapping between keys and columns

But if you didn't get the message that there is a 1:1 mapping, you would need to account for that, because you would end up with more rows in your annotation data.frame than in your counts, and the annotation wouldn't match up. How one does that is perhaps beyond the scope of this thread,

ADD REPLY • link 5.6 years ago James W. MacDonald 68k

0

Entering edit mode

And never mind all that. You won't be able to use goana; as the help page says, you can look at ?alias2Symbol to get all the organisms that you can use:

    'species' can be any character string XX for which an organism
     package org.XX.eg.db exists and is installed. The only requirement
     of the organism package is that it contains objects
     'org.XX.egALIAS2EG' and 'org.XX.egSYMBOL' linking the aliases and
     symbols to Entrez Gene Ids. At the time of writing, the following
     organism packages are available from Bioconductor 3.6:

         Package            Species             
         org.Ag.eg.db       Anopheles           
         org.Bt.eg.db       Bovine              
         org.Ce.eg.db       Worm                
         org.Cf.eg.db       Canine              
         org.Dm.eg.db       Fly                 
         org.Dr.eg.db       Zebrafish           
         org.EcK12.eg.db    E coli strain K12   
         org.EcSakai.eg.db  E coli strain Sakai 
         org.Gg.eg.db       Chicken             
         org.Hs.eg.db       Human               
         org.Mm.eg.db       Mouse               
         org.Mmu.eg.db      Rhesus              
         org.Pt.eg.db       Chimp               
         org.Rn.eg.db       Rat                 
         org.Ss.eg.db       Pig                 
         org.Xl.eg.db       Xenopus

Which doesn't include org.Sc.sgd.db. You should switch to goseq, which does work with Saccharomyces.

ADD REPLY • link 5.6 years ago James W. MacDonald 68k

0

Entering edit mode

Thank you for your answers, I did use goseq.

ADD REPLY • link 5.6 years ago matrs ▴ 10

0

Entering edit mode

That's a point where some of my confusion comes, my DGLRT object doesn't have a genes item and from the manual I don't get that I have to create it. Is It supposed to have a genes item?

> names(dgelist_qltest)
 [1] "coefficients"          "fitted.values"         "deviance"              "method"               
 [5] "unshrunk.coefficients" "df.residual"           "design"                "offset"               
 [9] "dispersion"            "prior.count"           "AveLogCPM"             "df.residual.zeros"    
[13] "df.prior"              "var.post"              "var.prior"             "samples"              
[17] "table"                 "comparison"            "df.test"               "df.total"             
[21] "entrezgene_id"         "geneid"

The two last items were created by me.

ADD REPLY • link 5.6 years ago matrs ▴ 10

0

Entering edit mode

Yes. See my first answer. As an example, cribbed from ?DGEList

>  y <- matrix(rnbinom(10000,mu=5,size=2),ncol=4)
> library(org.Hs.eg.db)
## pretend like these are human genes
> row.names(y) <- keys(org.Hs.eg.db)[1:nrow(y)]

> annot <- select(org.Hs.eg.db, row.names(y), c("SYMBOL","GENENAME"))
'select()' returned 1:1 mapping between keys and columns
> dglst <- DGEList(y, genes = annot)
> dglst <- calcNormFactors(dglst)
> design <- model.matrix(~gl(2,2))
> dglst <- estimateDisp(dglst, design)
> fit <- glmQLFit(dglst, design)
> tst <- glmQLFTest(fit, 2)
> topTags(tst)
Coefficient:  gl(2, 2)2 
     ENTREZID  SYMBOL                                         GENENAME
2254     2254    FGF9                       fibroblast growth factor 9
94         94  ACVRL1                   activin A receptor like type 1
2086     2086  ERV3-1 endogenous retrovirus group 3 member 1, envelope
2856     2856   GPR33  G protein-coupled receptor 33 (gene/pseudogene)
772       772    CACD              central areolar choroidal dystrophy
2159     2159     F10                             coagulation factor X
861       861   RUNX1              runt related transcription factor 1
1368     1368     CPM                               carboxypeptidase M
1318     1318 SLC31A2                solute carrier family 31 member 2
1497     1497    CTNS        cystinosin, lysosomal cystine transporter
         logFC   logCPM         F      PValue       FDR
2254  6.596329 9.324105 12.942278 0.004170561 0.9908227
94    6.478047 9.235095 12.012533 0.005257713 0.9908227
2086  6.109880 8.971395 11.097805 0.006671991 0.9908227
2856  6.024536 8.912995 10.454189 0.007942445 0.9908227
772  -5.732565 8.721117  9.549125 0.010252893 0.9908227
2159 -5.847485 8.794307  9.467513 0.010498265 0.9908227
861  -5.719758 8.713577  9.169355 0.011456287 0.9908227
1368 -5.617202 8.649418  9.123042 0.011614292 0.9908227
1318 -6.046482 8.926302  8.898173 0.012419446 0.9908227
1497  5.609782 8.645777  8.810166 0.012752585 0.9908227

> z <- goana(tst, "ENTREZID")
No DE genes

ADD REPLY • link 5.6 years ago James W. MacDonald 68k

0

Entering edit mode

I see, I never use genes= in this step:

dglst <- DGEList(y, genes = annot)

What happens is that I'm coming from salmon transcript counts -->tximport gene level -->edgeR. So when I go from transcripts level to gene level, I use the direct translation from SGD (same as ensembl) transcripts to SGD genes. Of course in that step I also can map those to entrez. Anyways, I'll modify my workflow to include the genes parameter. Thanks for your help.

ADD REPLY • link 5.6 years ago matrs ▴ 10

0

Entering edit mode

The genes component is documented as being optional, see ?"DGEList-class" or ?"DGELRT-class". You can include the genes data.frame when you create the DGEList object with DGEList() or you can add it later on at any time by:

dge$genes <- data.frame(GeneID=geneid)

or

dge_qltest$genes <- data.frame(GeneID=geneid)

or similar. The only requirement is that genes has the same number of rows as counts. We didn't specially document the ability to add the genes component on the fly because it's just standard base R programming.

What you can't do it is to add undocumentated components to an edgeR object and expect edgeR to take any notice of them.

ADD REPLY • link 5.6 years ago Gordon Smyth 52k

score 2 · Accepted Answer · 2019-07-25

What the error messages mean

It is good that you are reading the goana help page. You need to read the help pages ?goana.DGELRT and ?goana in conjunction. I will make some comments to try to help you interpret R help pages like this one.

goana.DGELRT(de=dgelist_qltest,

You never need to use goana.DGELRT() explicitly. Just use goana(dgelist_qltest, and the goana.DGELRT() function will be called automatically because of the class of dgelist_qltest.

No annotated genes found in universe

Well, this is because you tried to do a GO analysis for homo sapiens. Look at the usage line on the help page and you will see that the default setting for species is species = Hs. None of the row names in your object are human Entrez Gene Ids, hence the message.

I cannot find which element in the DGELRT object is de$genes

The help page tells you that de is a name for the first argument entered to goana and you have written in your own code that de=dgelist_qltest. So de$genes is dgelist_qltest$genes.

If dgelist_qltest$genes already exists, then you can add the Entrez Gene Ids by

dgelist_qltest$genes$GeneID <- geneid

or similar. If it doesn't exist, then you can add it by

dgelist_qltest$genes <- data.frame(GeneID=geneid)

This is just base R programming and follows from the fact that dge_qltest is a list.

so I added to de$table

Well, that won't work. You have to do what the help pages says.

org.Sc.eg.db package required but not not installed

As James as explained, goana only works for species for which a Bioconductor Entrez Id based organism package exists. goana can't use org.Sc.sgd.db.

Work around

As James has advised, a good solution is to use the goseq package instead of goana for this species.

Alternatively, you can always used kegga instead of goana for non-standard species. In this case you can use:

library("org.Sc.sgd.db")
ORF.GO <- toTable(org.Sc.sgdGO2ALLORFS)
library(GO.db)
GO.Name <- toTable(GOTERM)
GO.Name <- GO.Name[, c("go_id","Term")]
k <- kegga(dgelist_qltest, gene.pathway=ORF.GO, pathway.names=GO.Name)
topKEGG(k)

and the results will be exactly as if goana supported this species. Note that you don't need to convert geneids in this approach.