Question

How to run Goseq for list of ensembl gene IDs

0

Entering edit mode

Lucky • 0

@lucky-24920

Last seen 7 days ago

India

Hello,

I have the list of gene IDs which are ensembl Ids I wanted do GO analysis for this list of genes. How can I do this from the GOseq I went through vignette which is running the GOseq from the DE genes output, my genes list are not from the DE this are obtained according to my objective.

can anyone suggest me?

Gene IDs list is looking as

ENSMUSG00000061111

ENSMUSG00000062761

ENSMUSG00000063445

ENSMUSG00000066705

ENSMUSG00000070372

ENSMUSG00000073411

ENSMUSG00000073684

ENSMUSG00000074182

ENSMUSG00000078453

code What I have run data containes the above listed gene name

        gene=read.table("data", header=T)
        gene<- gene$gene_name
        pwf=nullp(gene,"mm10","ensGene")

Error what I got

        Can't find mm10/ensGene length data in genLenDataBase...
        A TxDb annotation package exists for mm10. Consider installing it to get the gene lengths.
       Trying to download from UCSC. This might take a couple of minutes. 
       Error in value[[3L]](cond) : 
       Length information for genome mm10 and gene ID ensGene is not available. You will have to specify bias.data manually.
       In addition: Warning messages:
       1: In library() :
       libraries ‘/usr/local/lib/R/site-library’, ‘/usr/lib/R/site-library’ contain no packages
      2: In grep(txdbPattern, installedPackages) :
      argument 'pattern' has length > 1 and only the first element will be used

Thank you

goseq • 3.0k views

ADD COMMENT • link 3.7 years ago Lucky • 0

0

Entering edit mode

It's not clear what your question is. You don't necessarily need a set of DE genes to run goseq, just a set of genes that you think are interesting for some reason. Is there something in particular you need help with?

ADD REPLY • link 3.7 years ago James W. MacDonald 67k

0

Entering edit mode

I have the txt file which containes the above list of genes, how can I run goseq. please help me with some code it will be very helpfull

Thank you

ADD REPLY • link 3.7 years ago Lucky • 0

0

Entering edit mode

I think you may misunderstand the purpose of this support site. The goal is to help people with specific questions about the software rather than providing code to analyze other people's data. The vignette is intended to help with the latter type of question, so I would recommend starting there.

ADD REPLY • link 3.7 years ago James W. MacDonald 67k

0

Entering edit mode

Hello James I understood the purpose of the site, I edited my question provided the code and the error while running can you please help me how to resolve

ADD REPLY • link 3.7 years ago Lucky • 0

score 1 · Answer 1 · 2021-03-31

1

Entering edit mode

James W. MacDonald 67k

@james-w-macdonald-5106

Last seen 5 hours ago

United States

So when you tried what you tried, you got this message:

Can't find mm10/ensGene length data in genLenDataBase...
A TxDb annotation package exists for mm10. Consider installing it to get the gene lengths.
Trying to download from UCSC. This might take a couple of minutes. 
Error in value[[3L]](cond) : 
Length information for genome mm10 and gene ID ensGene is not available. You will have to specify bias.data manually.

It's important to read and try to understand the error and warning messages that you get. Basically this is saying that the attempt to get the gene lengths automatically has failed and that you will have to specify these data manually. In general I think this is a pretty clear message?

If you haven't read the vignette, you should do so, particularly section 5.3, which covers this issue. Anyway, it says to do this

> tx <- transcriptsBy(TxDb.Mmusculus.UCSC.mm10.ensGene)
> ld <- median(width(tx))
> head(ld)
ENSMUSG00000000001 ENSMUSG00000000003 ENSMUSG00000000028 ENSMUSG00000000031 
           38867.0            15676.0            31174.0             1589.0 
ENSMUSG00000000037 ENSMUSG00000000049 
           95425.0            22637.5

And from ?nullp, there is this

bias.data: A numeric vector containing the data on which the DE may
          depend.  Usually this is the median transcript length of each
          gene in bp.  If set to 'NULL' 'nullp' will attempt to fetch
          length using 'getlength'.

So that's how you should figure out what you need, and how you get it. Knowing how to do that sort of thing is a necessary skill if you want to be able to use R or Bioconductor well, so it's worth learning.

ADD COMMENT • link 3.7 years ago James W. MacDonald 67k

0

Entering edit mode

Thank for the suggestions i will follow it up,

library(goseq)

library(GenomicFeatures)

library(TxDb.Mmusculus.UCSC.mm10.ensGene)

gene=read.table("data", header=T)

gene=gene$gene_ID

tx <- transcriptsBy(TxDb.Mmusculus.UCSC.mm10.ensGene)

ld<- median(width(tx))

length=ld[gene]

pwf=nullp(gene,bias.data=length)

again i got the error when I run the above code
Error in sum(y[ww][1:size]) : invalid 'type' (character) of argument

ADD REPLY • link 3.7 years ago Lucky • 0

0

Entering edit mode

I already pointed you to the help page for nullp, so that's what I am going to do, again. If you ever expect to get anywhere as a user of R, or with any Open Source software, you will need to get better at reading and understanding the help pages. So a couple of questions:

1.) What are you passing in as your 'gene' vector?

2.) Does it conform to the expectations for this function?

Do note that the idea behind a GO hypergeometric test is to have a set of genes, some of which are 'significant', where by significant I mean that you have decided, using some criteria, that those genes are of interest to you and all other genes (which comprise the 'universe' of such genes that you could have chosen, but didn't) are considered 'non-significant'. So when you read about the DEgenes argument, keep that in mind, because you probably don't want to be subsetting your ld object like that.

ADD REPLY • link 3.7 years ago James W. MacDonald 67k

0

Entering edit mode

for your question

only i have gene ids for that I have run gene=gene$gene_ID
I don't think so, data is not reaching the function expectation because of that I have raised the question

In goseq vignette its taking input from the deseq2. Taking < 0.05 fdr values consider as 1 and and rest as 0 in my data I can't do like that because i have list of Ids so raised the question how can I run the goseq. is it clear

ADD REPLY • link 3.7 years ago Lucky • 0

0

Entering edit mode

Right. So for DESeq2, they ran a test to get p-values and said anything with a p<0.05 is DE and anything else is not DE. And so they make this named vector of zeros and ones, where the names are the Ensembl Gene IDs, and it's 0 if the gene is not DE and 1 if it is.

You are saying that you have this set of genes that you think are interesting for some reason, right? And by definition that means that all the other murine genes are not interesting. That's essentially the same exact thing as what they do in the vignette, other than the criterion you used to say if a gene is interesting or not. So you need a named vector where the names are Ensembl Gene IDs and the values are 0 or 1 depending on whether or not you think the gene is interesting or not.