Hello, I'm a new (<4months) R user working in an academic lab, and trying to do some RNAseq analysis for an experiment. I am using DESeq2 to compare expression levels.
What is the best way to integrate a DESeqDataSet with TopGO or another gene ontology enrichment package?
After creating the DESeqDataSet, I would like to use a list of the top differentially expressed genes from an RNAseq run to get a list of GO terms, and then to enrich that list of GO terms to the most likely parent. Ideally, I would be able to use accessions from the same assembly the libraries were aligned to. I am working with S. aureus USA300 and an S. cerevisiae S288C variant.
Does anyone have any pointers about how to approach this best, or resources that would be helpful in this project?
Thanks!
Just to add to James' suggestion, if you look at the goseq vignette, it says:
"In order to perform a GO analysis of your RNA-seq data, goseq only requires a simple named vector, which contains two pieces of information.
1. Measured genes: all genes for which RNA-seq data was gathered for your experiment. Each element of your vector should be named by a unique gene identifier.
2. Differentially expressed genes: each element of your vector should be either a 1 or a 0, where 1 indicates that the gene is differentially expressed and 0 that it is not."
So this can be obtained with:
From here you can continue, as in Section 2 for example.
Hi, Michael,
Thank you for the script here. I have a question: why did you suggest to turn off independent filtering when generating the result for GOseq analysis? Will this have any effect on the final list of DE genes and their p values?
Yes, filtering affects the final list of DE genes. I'd recommend you pick a threshold by looking at the MA plot. The x-axis is the mean of normalized counts across samples, so you may want to use row mean instead of row sum as used above.
My reason for not using the built-in filtering in the code example above is that the GOseq model is not perhaps expecting that some of the adjusted p-values be NA. And so It's perhaps safer for the hand-off between packages that all genes have the same chance at being a member of de.genes.
I use the standard DEseq2 pipeline to generate the DE list for follow-up study. What if I just simply remove the genes with p-value of NA and then run GOseq? Wouldn't this way be more consistent with the DE result that uses independent filtering automatically?
Sorry if my question does not make sense, I am a biologist trying to do all the informatics analysis myself.
Sure, that should work as well. The significant genes are those with adjusted p-value < x and the universe is the set of genes with adjusted p-value not NA.
Thank you for your fast reply!! It's very helpful!