Question: goseq code after DESeq2 -NO IDEA!
0
3.5 years ago by
amyfm0
Ireland
amyfm0 wrote:

Hi,

After DESeq2 analysis of my RNAseq data  in order to obtain differentially expressed genes between 2 cell types, I have a csv file with approximatelly 26000 genes, of which around 6000 genes are differentially expressed (padjustedvalue < 0.05). Now I want to use goseq in order to study pathways.

I am a very amateur R user and I am starting to learn about all this, so I have no idea what I have to do now. By checking this website http://www.bioconductor.org/packages/release/bioc/vignettes/goseq/inst/doc/goseq.pdf, I understand that I have to create 2 vectors, one with all my genes, and other with the DEGs. After that, I get lost. What I have done until now in order to create the vectors is the following, although I am not sure if it is correct:

all_genes <- row.names(d)
DE_genes <- all_genes[which(d$padj<0.05)] head(DE_genes) I will really appreciate if someone can give me the scrips/code in order to do goseq analysis, starting from my file.csv with all the genes. Thank you goseq deseq2 rstudio code • 1.7k views ADD COMMENTlink modified 3.5 years ago by Steve Lianoglou12k • written 3.5 years ago by amyfm0 Answer: goseq code after DESeq2 -NO IDEA! 1 3.5 years ago by Denali Steve Lianoglou12k wrote: It's helpful if you showed us the output of your call to head. So you have this: d <- read.csv("myfile.csv", header=T, row.names=1) all_genes <- row.names(d) DE_genes <- all_genes[which(d$padj<0.05)]
head(DE_genes)

And what you really need is for DE_genes to be a vector of 0 and 1's, and its names() set to the geneIDs. You'd get that by doing something like this:

DE_genes <- as.integer(d\$padj <= 0.05)
names(DE_genes) <- rownames(d)

Then you can take it from there ...

thank you very much!!!

but actually I don't really know how to continue now. As I said I have just started to use R and I don't know how to continue to do the goseq analysis. Could you help me with the code? In my previous analysis I aligned my RNA sequences against the bovine ensembl82 genome

The goseq vignette you linked to is pretty thorough, so you should read that very closely.

I'm not sure what type of help I can provide that's better than what is in there in the absence of more specific questions from you.

Is your organism listed in the output from goseq::supportedGenomes?

If not, you need to get more acquainted with how goseq is working, in which case you should read the sections entitled  "Non-native Gene Identifier or category test" (section 5) and "Understanding goseq internals" (section 6.5.7) with particular care.

I am using the bovine Ensembl 82, I think it is not listed. But how can I supply the length data and category mappings? Which is the code?

The help I would like to have if that someone can provide me the code in order to perform the goseq analysis starting from an excel with the DEGs (in ensembl transcript ID format).

I'm sorry, but I just don't have the time to provide you with code to do your entire analysis. I already pointed you to the relevant parts of the vignette that instructs you how to enter custom gene lengths and gene <-> GO mappings.

The vignette assumes that you have this information "somewhere." Is the problem that you think the vignette is unclear on how to proceed given that you have this information, or that you need help in figuring out where to even get the gene lengths and the GO mappings to begin with?

Thank you very much! I am not very familiar with creating vectors or data frames but that should be easy to learn. What I don't understand is how I am supposed to write the length of each gene, as I have thousands!

However, the annotation of the genome that I am using is available on Ensembl, so I could do this: "Any organism for which there is an annotation on either Ensembl or the UCSC, can be easily turned into length data using the GenomicFeatures package. To do this, first create a TranscriptDb object using either makeTxDbFromBiomart or makeTxDbFromUCSC (see the help in the GenomicFeatures package on using these commands). Once you have a transcriptDb object, you can get a vector named by gene ID containing the median transcript length of each gene simply by using the command.

> txsByGene=transcriptsBy(txdb,"gene")

> lengthData=median(width(txsByGene)) "

I am just stuck at creating a TranscriptDb object.

Well, you must have used some gene models from somewhere in order to count the number of reads that land to each gene. Those same models will give you the first approximation of how long your genes are.

Since you are so new to both R and bioinformatics, I think you'll find your time better served by reading up on some R tutorials for basic programming (or take an online class/MOOC ... there are so many, now).

After mastering the basics of R programming (or, at least, feeling comfortable with it), I'd then spend time to work on your specific task at hand.

Given your immediate need to analyze your current data, you might think that this is an unnecessary long road to take, but hear me now and believe me later: you will most assuredly get to a proper result by taking your time to initially understand basic R programming concepts first, rather then trying to take random stabs-at-the-dark to get some result for your immediate question (which, no offense, will likely have it's own errors, and therefore be wrong).

You'll also have the added bonus of being able to explain your results and the exact details of how you got them to whoever you are sharing your insights with.