Using goseq on non-model organism - how to define genome?
1
0
Entering edit mode
Jon Bråte ▴ 230
@jon-brate-6263
Last seen 13 months ago
Norway

Hi,

I have a few questions about using goseq on a non-model organism:

I have a set of DE-genes identified by DESeq2 called genes,

a vector of gene lengths called lengthData created like this:

txdb = makeTxDbFromGFF("../merged_fixed.gtf", format = ("gtf"))
txsByGene=transcriptsBy(txdb,"gene")
lengthData=median(width(txsByGene))

And I have a reference set of genes called backM which is a list of gene names

I am at this point of the goseq manual:

pwf=nullp(genes,"hg19","ensGene")

And my question is how do I refer to my non-model genome and what should I provide as id?

Thanks, Jon

goseq go genesetenrichment • 3.6k views
0
Entering edit mode

Is your organism supported by Annotation Hub? Try:

library(AnnotationHub)
hub = AnnotationHub()
query(hub, c("OrgDb"))
0
Entering edit mode

No it's not, but thanks for the Annotation Hub tip! I haven't used if before.

0
Entering edit mode
@gordon-smyth
Last seen 13 minutes ago
WEHI, Melbourne, Australia

Initially I said "Obviously you cannot do a GO analysis for a species for which no GO database exists".

However later Steve Lianglou pointed out that you can do a goseq analysis of any species if you provide your own GO mapping. I was not familiar with this part of the goseq package, so you should follow Steve's suggestions below.

1
Entering edit mode

Ok, I see. So it is not possible to use goseq on organisms not in supporedGenomes()? From the manual I got the impression that if I supplied gene length information and the GO-mapping manually I could still run it?

1
Entering edit mode

You can. You will have to provide a mapping of your gene id's to go (or whatever) id's wired up in a data.frame, too. It's doable

1
Entering edit mode

Thanks! I have created such a mapping with Blast2GO. But I guess this is used in the goseq function?

For nullp() I just skipped providing "genome" and "id" and entered the lengthData as bias.data and it seems to work. I did actually read the nullp documentation before posting the question, but it was not so obvious to me that genome and id could be skipped.

2
Entering edit mode

Now that you have the output from nullp and the list of differentially expressed genes, you just need to create the gene2cat object for a call to goseq. The documentation in ?goseq describes the format reasonably well. It is of the format that is returned from the getgo function as well. You can look at the help in ?getgo (and run the example code there) to demystify that a bit further, if necessary.