Question

Gene Ontology Analysis of DE Genes

0

Entering edit mode

vanbelj ▴ 30

@vanbelj-21216

Last seen 2.0 years ago

United States

I've used DESeq2 to identify Differentially Expressed genes in an RNA-Seq dataset from S. cerevisiae, budding yeast. I would like to identify Gene Ontology terms that are over represented in this set of DE genes. AnnotationDbi and its object-types are new to me, and I'm an R novice - the struggle is real. Thanks for any help.

My sequencing reads were mapped to the genome build/version sacCer3. sacCer3 was released in 2014.

I've tried goseq, which has built in support for sacCer2 but not sacCer3 (oy vey!). I was able to run my dataset using sacCer2, but are GO Annotations for sacCer2 in goseq up to date? I've tried to make my own genome/object for use with goseq. I have a vector of gene names and lengths, but I cannot figure out how to incorporate GO Terms and then make a suitable 'object' that can be fed into goseq. I've read the User Guide, but still haven't been able to figure this out.

I figured I'd try clusterProfiler instead. However, I noticed that org.Sc.sgd.db is showing ~1300 fewer genes than my list from sacCer3 (see code).
I assume that my provided gene list must have a matching name within org.Sc.sgd.db in order for associated GO Terms to be assigned properly, so a bit concerned.

1) Is org.Sc.sgd.db going to assign GO Terms to all of my genes - or do I need to use a different naming convention? 2) Are there truly 1300 ENSEMBL gene names missing from org.Sc.sgd.db, or am I looking at this the wrong way? If annotations are missing, I can email a list of annotations to be included in the next update. Looks like it's maintained by Bioconductor.

I've confirmed that the gene-name convention is consistent in my comparisons (see code).

#My list of Genes from sacCer3, downloaded from SGD (yeastgenome.org)
> length(SGD_YeastGenes_NAr$Systematic_Name) 
[1] 7143
> head(SGD_YeastGenes_NAr$Systematic_Name)
[1] "YAL001C" "YAL002W" "YAL003W" "YAL005C" "YAL007C" "YAL008W"

#Genes from goseq sacCer2 length table
> length(sacCer2.ensGene.LENGTH$Gene) 
[1] 7124
> head(sacCer2.ensGene.LENGTH$Gene)
[1] "YAL012W"   "YAL069W"   "YAL068W-A" "YAL068C"   "YAL067W-A" "YAL067C" 

#Genes from org.Sc.sgd.db
> length(keys(org.Sc.sgd.db, keytype="ENSEMBL")) 
[1] 5804
> head(keys(org.Sc.sgd.db, keytype="ENSEMBL"))
[1] "YAL068C"   "YGL261C"   "YAL067W-A" "YAL067C"   "YAL065C"   "YAL064W-B"

#How many genes are in my list, that are not present in the org.Sc.sgd.db ENSEMBL list?
> length(setdiff(SGD_YeastGenes_NAr$Systematic_Name, keys(org.Sc.sgd.db, keytype="ENSEMBL")))
[1] 1339

#How many genes are in the org.Sc.sgd.db ENSEMBL list that are not present in my sacCer3 list?
> length(setdiff(keys(org.Sc.sgd.db, keytype="ENSEMBL"), SGD_YeastGenes_NAr$Systematic_Name))
[1] 0

#Ran my DE Genes through clusterProfiler anyways...
#Get a list of DE gene names
results.df <- as.data.frame(resOrdered) %>% #resOrdered is a DESeqResults object
  rownames_to_column(var = "Systematic_Name") %>%
  filter(!is.na(log2FoldChange)) %>%
  filter(!is.na(padj))
gene_list <- results.df$Systematic_Name[results.df$padj < 0.05 & results.df$log2FoldChange != 0]
> head(gene_list)
[1] "YAL036C" "YAL061W" "YBL039C" "YBL042C" "YBL064C" "YBR072W"

#Run the list with clusterProfiler::enrichGO
> ego <- enrichGO(gene          = gene_list,
+                 OrgDb         = org.Sc.sgd.db,
+                 keyType       = "ENSEMBL",
+                 ont           = "CC",
+                 pAdjustMethod = "BH",
+                 pvalueCutoff  = 0.01,
+                 qvalueCutoff  = 0.05,
+                 readable      = FALSE)

> head(ego) #I've only copied a portion.
                   ID                          Description GeneRatio  BgRatio       pvalue     p.adjust
GO:0030684 GO:0030684                          preribosome  168/4255 171/5794 2.137353e-19 4.484091e-17
GO:0005730 GO:0005730                            nucleolus  273/4255 292/5794 2.281980e-19 4.484091e-17
GO:0022626 GO:0022626                   cytosolic ribosome  146/4255 156/5794 6.381083e-11 8.359219e-09
GO:0030686 GO:0030686                      90S preribosome   89/4255  92/5794 2.457709e-09 2.414699e-07
GO:0030687 GO:0030687 preribosome, large subunit precursor   62/4255  62/5794 4.319239e-09 3.394922e-07
GO:0032040 GO:0032040             small-subunit processome   53/4255  53/5794 7.181073e-08 4.703603e-06

My sessionInfo( ) is ridiculous - 3 days of attempts/packages; omitting for now.

clusterProfiler AnnotationDbi OrgDb goseq • 3.1k views

ADD COMMENT • link updated 5.1 years ago by James W. MacDonald 68k • written 5.1 years ago by vanbelj ▴ 30

1

Entering edit mode

Where did you get the genome you are using, and how did you get the gene counts (e.g., what annotation service did you use to get the gene identifiers and locations)? In general you want to stick to one annotation service, so that part matters. FWIW, the org.Sc.sgd.db package is based on data we download from yeastgenome.org, and the central ID for that package is the SGD ID. So if you want to use that to map to GO terms, your best bet is to start with SGD IDs.

ADD REPLY • link 5.1 years ago James W. MacDonald 68k

0

Entering edit mode

Thanks James.

org.Sc.sgd.db contains identifiers within its SGD column for my entire list of genes, and I was able to successfully run clusterProfiler with these values. There are actually an additional ~9000 gene IDs that aren't in my gene list; I'll have to explore these later to see if I'm missing any relevant annotations.

I downloaded my gene table from SGD. It was the latest sacCer3 version at the time (R64-2-1). What link do you use to pull your data for the org.Sc.sgd.db package? I will speak with SGD and see if they can add/complete their Systematic Names in the resource you download so that the ENSEMBL column is complete in future iterations of the package.

Posted the code below for others' future reference.

> head(gene_list)
[1] "S000000034" "S000000057" "S000000135" "S000000138" "S000000160" "S000000276"

> head(keys(org.Sc.sgd.db, keytype="SGD"))
[1] "S000000001" "S000000002" "S000000003" "S000000004" "S000000005" "S000000006"

> ego <- enrichGO(gene          = gene_list,
+                 OrgDb         = org.Sc.sgd.db,
+                 keyType       = "SGD",
+                 ont           = "CC",
+                 pAdjustMethod = "BH",
+                 pvalueCutoff  = 0.01,
+                 qvalueCutoff  = 0.05,
+                 readable      = FALSE)

ADD REPLY • link 5.1 years ago vanbelj ▴ 30

score 1 · Answer 1 · 2021-01-29

The data for the main table in org.Sc.sgd.db comes from the gene_association.sgd.gz file. You could assume that SGD is somehow missing the Ensembl Gene ID in that file, but that is likely not correct. They probably do have all the Ensembl Gene IDs that they can get from Ensembl; it's way more likely that SGD says 'This right here is a gene' and Ensembl says 'Not to us it isn't', in which case there won't be an Ensembl Gene ID.

This is a relatively common phenomenon that can arise when two groups are trying to define (for example) what all the genes are, and where they are located in the genome. Unless you have the same exact criteria for making those distinctions, you should expect disagreements.