I recently run into the problem that the gene annotation in MSigDB is not up to date (for example, the alias C14orf129 is used as a gene name in various gene sets, while the official HGNC symbol would be GSKIP). To fix this I have written a function to update all gene symbols:
alias.to.hgnc.gene.set.collection <- function(gene.set.collection, verbose = FALSE) { if (class(gene.set.collection) != "GeneSetCollection") { stop("Input is not a GeneSetCollection") } ## Get alias list using Duff's answer at https://www.biostars.org/p/14971/ library(org.Hs.eg.db) dbCon <- org.Hs.eg_dbconn() sqlQuery <- "SELECT * FROM alias, gene_info WHERE alias._id == gene_info._id;" library(DBI) aliasSymbol <- dbGetQuery(dbCon, sqlQuery) ## Make a lookup table (aliasSymbol) with messages included ## Alias_symbol column ALWAYS contains also official HGNC symbol for every gene aliasSymbol <- aliasSymbol[, c("alias_symbol", "symbol")] aliasSymbol <- aliasSymbol[!duplicated(aliasSymbol), ] aliasSymbol[, "message"] <- "" duplicated.aliases <- unique(aliasSymbol$alias_symbol[duplicated(aliasSymbol$alias_symbol)]) # Remove second line for following example: # alias_symbol symbol # ACTBP2 ACTBP2 # ACTBP2 ACTBP8 indices <- unlist(lapply(intersect(duplicated.aliases, aliasSymbol$symbol), function(x) { tmp.index <- which(aliasSymbol$alias_symbol == x) return(tmp.index[aliasSymbol$symbol[tmp.index] != x]) })) aliasSymbol <- aliasSymbol[-indices, ] # Remove second line for following example: # alias_symbol symbol # DUX4L DUX4 # DUX4L LOC112268343 indices <- NULL for (gene in setdiff(duplicated.aliases, aliasSymbol$symbol)) { tmp.index <- which(aliasSymbol$alias_symbol == gene) aliasSymbol$message[tmp.index[1]] <- paste0("Ambiguous alias name detected: ", gene, "; NA introduced") aliasSymbol$symbol[tmp.index[1]] <- NA indices <- c(indices, tmp.index[2:length(tmp.index)]) } aliasSymbol <- aliasSymbol[-indices, ] ## Use lookup table to create newly annotated gene set collection new.gene.set.collection <- lapply(gene.set.collection, function(x) { genes <- geneIds(x) matched <- match(genes, aliasSymbol$alias_symbol) if (verbose & sum(is.na(matched)) > 0) { message(paste0("Missing alias / HGNC name detected: ", genes[which(is.na(matched))], "; NA introduced", collapse = "\n")) } if (verbose) { messages <- aliasSymbol$message[matched] messages <- messages[messages != ""] messages <- messages[!is.na(messages)] if (length(messages) > 0) { message(paste0(messages, collapse = "\n")) } } new.genes <- aliasSymbol$symbol[matched] if (verbose & !all(table(new.genes, useNA = "ifany") == 1)) { tab <- table(new.genes, useNA = "ifany") message(paste0("Duplicate HGNC symbol detected for ", names(tab[tab > 1]), " (", tab[tab > 1], " times)")) } geneIds(x) <- unique(as.character(new.genes)) x }) return(GeneSetCollection(new.gene.set.collection)) }
Arguments are gene.set.collection
, which needs to be of class GeneSetCollection
(library: GSEABase
; Bioconductor), as well as verbose
(default: false). The function requires libraries org.Hs.eg.db
(Bioconductor) and DBI
(CRAN). While creating the lookup table might not be the fastest way for conversion of smaller GeneSetCollections, it dramatically speeds up performance on bigger GeneSetCollections. I hope this is helpful to you.