I've used DESeq2 to identify Differentially Expressed genes in an RNA-Seq dataset from S. cerevisiae, budding yeast. I would like to identify Gene Ontology terms that are over represented in this set of DE genes. AnnotationDbi and its object-types are new to me, and I'm an R novice - the struggle is real. Thanks for any help.
My sequencing reads were mapped to the genome build/version
sacCer3. sacCer3 was released in 2014.
goseq, which has built in support for sacCer2 but not sacCer3 (oy vey!). I was able to run my dataset using sacCer2, but are GO Annotations for sacCer2 in goseq up to date? I've tried to make my own genome/object for use with goseq. I have a vector of gene names and lengths, but I cannot figure out how to incorporate GO Terms and then make a suitable 'object' that can be fed into
goseq. I've read the User Guide, but still haven't been able to figure this out.
I figured I'd try
clusterProfiler instead. However, I noticed that
org.Sc.sgd.db is showing ~1300 fewer genes than my list from sacCer3 (see code).
I assume that my provided gene list must have a matching name within
org.Sc.sgd.db in order for associated GO Terms to be assigned properly, so a bit concerned.
org.Sc.sgd.db going to assign GO Terms to all of my genes - or do I need to use a different naming convention?
2) Are there truly 1300 ENSEMBL gene names missing from
org.Sc.sgd.db, or am I looking at this the wrong way? If annotations are missing, I can email a list of annotations to be included in the next update. Looks like it's maintained by Bioconductor.
I've confirmed that the gene-name convention is consistent in my comparisons (see code).
#My list of Genes from sacCer3, downloaded from SGD (yeastgenome.org) > length(SGD_YeastGenes_NAr$Systematic_Name)  7143 > head(SGD_YeastGenes_NAr$Systematic_Name)  "YAL001C" "YAL002W" "YAL003W" "YAL005C" "YAL007C" "YAL008W" #Genes from goseq sacCer2 length table > length(sacCer2.ensGene.LENGTH$Gene)  7124 > head(sacCer2.ensGene.LENGTH$Gene)  "YAL012W" "YAL069W" "YAL068W-A" "YAL068C" "YAL067W-A" "YAL067C" #Genes from org.Sc.sgd.db > length(keys(org.Sc.sgd.db, keytype="ENSEMBL"))  5804 > head(keys(org.Sc.sgd.db, keytype="ENSEMBL"))  "YAL068C" "YGL261C" "YAL067W-A" "YAL067C" "YAL065C" "YAL064W-B" #How many genes are in my list, that are not present in the org.Sc.sgd.db ENSEMBL list? > length(setdiff(SGD_YeastGenes_NAr$Systematic_Name, keys(org.Sc.sgd.db, keytype="ENSEMBL")))  1339 #How many genes are in the org.Sc.sgd.db ENSEMBL list that are not present in my sacCer3 list? > length(setdiff(keys(org.Sc.sgd.db, keytype="ENSEMBL"), SGD_YeastGenes_NAr$Systematic_Name))  0 #Ran my DE Genes through clusterProfiler anyways... #Get a list of DE gene names results.df <- as.data.frame(resOrdered) %>% #resOrdered is a DESeqResults object rownames_to_column(var = "Systematic_Name") %>% filter(!is.na(log2FoldChange)) %>% filter(!is.na(padj)) gene_list <- results.df$Systematic_Name[results.df$padj < 0.05 & results.df$log2FoldChange != 0] > head(gene_list)  "YAL036C" "YAL061W" "YBL039C" "YBL042C" "YBL064C" "YBR072W" #Run the list with clusterProfiler::enrichGO > ego <- enrichGO(gene = gene_list, + OrgDb = org.Sc.sgd.db, + keyType = "ENSEMBL", + ont = "CC", + pAdjustMethod = "BH", + pvalueCutoff = 0.01, + qvalueCutoff = 0.05, + readable = FALSE) > head(ego) #I've only copied a portion. ID Description GeneRatio BgRatio pvalue p.adjust GO:0030684 GO:0030684 preribosome 168/4255 171/5794 2.137353e-19 4.484091e-17 GO:0005730 GO:0005730 nucleolus 273/4255 292/5794 2.281980e-19 4.484091e-17 GO:0022626 GO:0022626 cytosolic ribosome 146/4255 156/5794 6.381083e-11 8.359219e-09 GO:0030686 GO:0030686 90S preribosome 89/4255 92/5794 2.457709e-09 2.414699e-07 GO:0030687 GO:0030687 preribosome, large subunit precursor 62/4255 62/5794 4.319239e-09 3.394922e-07 GO:0032040 GO:0032040 small-subunit processome 53/4255 53/5794 7.181073e-08 4.703603e-06
sessionInfo( ) is ridiculous - 3 days of attempts/packages; omitting for now.