goseq code after DESeq2 -NO IDEA!
1
1
Entering edit mode
amyfm ▴ 10
@amyfm-9084
Last seen 8.3 years ago
Ireland

Hi,

After DESeq2 analysis of my RNAseq data  in order to obtain differentially expressed genes between 2 cell types, I have a csv file with approximatelly 26000 genes, of which around 6000 genes are differentially expressed (padjustedvalue < 0.05). Now I want to use goseq in order to study pathways.

I am a very amateur R user and I am starting to learn about all this, so I have no idea what I have to do now. By checking this website http://www.bioconductor.org/packages/release/bioc/vignettes/goseq/inst/doc/goseq.pdf, I understand that I have to create 2 vectors, one with all my genes, and other with the DEGs. After that, I get lost. What I have done until now in order to create the vectors is the following, although I am not sure if it is correct:

d <- read.csv("myfile.csv", header=T, row.names=1)
all_genes <- row.names(d)
DE_genes <- all_genes[which(d$padj<0.05)]
head(DE_genes)

I will really appreciate if someone can give me the scrips/code in order to do goseq analysis, starting from my file.csv with all the genes.

 

Thank you

goseq DESEQ2 RSTUDIO CODE • 4.5k views
ADD COMMENT
2
Entering edit mode
@steve-lianoglou-2771
Last seen 13 months ago
United States

It's helpful if you showed us the output of your call to `head`. So you have this:

d <- read.csv("myfile.csv", header=T, row.names=1)
all_genes <- row.names(d)
DE_genes <- all_genes[which(d$padj<0.05)]
head(DE_genes)

And what you really need is for DE_genes to be a vector of 0 and 1's, and its `names()` set to the geneIDs. You'd get that by doing something like this:

DE_genes <- as.integer(d$padj <= 0.05)
names(DE_genes) <- rownames(d)

Then you can take it from there ...

ADD COMMENT
0
Entering edit mode

thank you very much!!!

but actually I don't really know how to continue now. As I said I have just started to use R and I don't know how to continue to do the goseq analysis. Could you help me with the code? In my previous analysis I aligned my RNA sequences against the bovine ensembl82 genome

ADD REPLY
0
Entering edit mode

The goseq vignette you linked to is pretty thorough, so you should read that very closely.

I'm not sure what type of help I can provide that's better than what is in there in the absence of more specific questions from you.

Is your organism listed in the output from `goseq::supportedGenomes`?

If not, you need to get more acquainted with how goseq is working, in which case you should read the sections entitled  "Non-native Gene Identifier or category test" (section 5) and "Understanding goseq internals" (section 6.5.7) with particular care.

 

ADD REPLY
0
Entering edit mode

I am using the bovine Ensembl 82, I think it is not listed. But how can I supply the length data and category mappings? Which is the code?

The help I would like to have if that someone can provide me the code in order to perform the goseq analysis starting from an excel with the DEGs (in ensembl transcript ID format).

ADD REPLY
0
Entering edit mode

I'm sorry, but I just don't have the time to provide you with code to do your entire analysis. I already pointed you to the relevant parts of the vignette that instructs you how to enter custom gene lengths and gene <-> GO mappings.

The vignette assumes that you have this information "somewhere." Is the problem that you think the vignette is unclear on how to proceed given that you have this information, or that you need help in figuring out where to even get the gene lengths and the GO mappings to begin with?

ADD REPLY
0
Entering edit mode

Thank you very much! I am not very familiar with creating vectors or data frames but that should be easy to learn. What I don't understand is how I am supposed to write the length of each gene, as I have thousands!

However, the annotation of the genome that I am using is available on Ensembl, so I could do this: "Any organism for which there is an annotation on either Ensembl or the UCSC, can be easily turned into length data using the GenomicFeatures package. To do this, first create a TranscriptDb object using either makeTxDbFromBiomart or makeTxDbFromUCSC (see the help in the GenomicFeatures package on using these commands). Once you have a transcriptDb object, you can get a vector named by gene ID containing the median transcript length of each gene simply by using the command.

> txsByGene=transcriptsBy(txdb,"gene")

> lengthData=median(width(txsByGene)) "

I am just stuck at creating a TranscriptDb object.

ADD REPLY
0
Entering edit mode

Well, you must have used some gene models from somewhere in order to count the number of reads that land to each gene. Those same models will give you the first approximation of how long your genes are.

Since you are so new to both R and bioinformatics, I think you'll find your time better served by reading up on some R tutorials for basic programming (or take an online class/MOOC ... there are so many, now).

After mastering the basics of R programming (or, at least, feeling comfortable with it), I'd then spend time to work on your specific task at hand.

Given your immediate need to analyze your current data, you might think that this is an unnecessary long road to take, but hear me now and believe me later: you will most assuredly get to a proper result by taking your time to initially understand basic R programming concepts first, rather then trying to take random stabs-at-the-dark to get some result for your immediate question (which, no offense, will likely have it's own errors, and therefore be wrong).

You'll also have the added bonus of being able to explain your results and the exact details of how you got them to whoever you are sharing your insights with.

 

ADD REPLY

Login before adding your answer.

Traffic: 634 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6