Tutorial:Update gene symbols in GeneSetCollection
0
1
Entering edit mode
t.kuilman ▴ 170
@tkuilman-6868
Last seen 2.5 years ago
Netherlands

I recently run into the problem that the gene annotation in MSigDB is not up to date (for example, the alias C14orf129 is used as a gene name in various gene sets, while the official HGNC symbol would be GSKIP). To fix this I have written a function to update all gene symbols:

 

alias.to.hgnc.gene.set.collection <- function(gene.set.collection, verbose = FALSE) {
  if (class(gene.set.collection) != "GeneSetCollection") {
    stop("Input is not a GeneSetCollection")
  }

  ## Get alias list using Duff's answer at https://www.biostars.org/p/14971/
  library(org.Hs.eg.db)
  dbCon <- org.Hs.eg_dbconn()
  sqlQuery <- "SELECT * FROM alias, gene_info WHERE alias._id == gene_info._id;"
  library(DBI)
  aliasSymbol <- dbGetQuery(dbCon, sqlQuery)

  ## Make a lookup table (aliasSymbol) with messages included
  ## Alias_symbol column ALWAYS contains also official HGNC symbol for every gene
  aliasSymbol <- aliasSymbol[, c("alias_symbol", "symbol")]
  aliasSymbol <- aliasSymbol[!duplicated(aliasSymbol), ]
  aliasSymbol[, "message"] <- ""
  duplicated.aliases <- unique(aliasSymbol$alias_symbol[duplicated(aliasSymbol$alias_symbol)])
  # Remove second line for following example:
  # alias_symbol  symbol
  # ACTBP2        ACTBP2
  # ACTBP2        ACTBP8
  indices <- unlist(lapply(intersect(duplicated.aliases, aliasSymbol$symbol), function(x) {
    tmp.index <- which(aliasSymbol$alias_symbol == x)
    return(tmp.index[aliasSymbol$symbol[tmp.index] != x])
  }))
  aliasSymbol <- aliasSymbol[-indices, ]
  # Remove second line for following example:
  # alias_symbol symbol
  # DUX4L        DUX4
  # DUX4L        LOC112268343
  indices <- NULL
  for (gene in setdiff(duplicated.aliases, aliasSymbol$symbol)) {
    tmp.index <- which(aliasSymbol$alias_symbol == gene)
    aliasSymbol$message[tmp.index[1]] <- paste0("Ambiguous alias name detected: ", gene,
                             "; NA introduced")
    aliasSymbol$symbol[tmp.index[1]] <- NA
    indices <- c(indices, tmp.index[2:length(tmp.index)])
  }
  aliasSymbol <- aliasSymbol[-indices, ]

  ## Use lookup table to create newly annotated gene set collection
  new.gene.set.collection <- lapply(gene.set.collection, function(x) {
    genes <- geneIds(x)
    matched <- match(genes, aliasSymbol$alias_symbol)
    if (verbose & sum(is.na(matched)) > 0) {
      message(paste0("Missing alias / HGNC name detected: ", genes[which(is.na(matched))],
                     "; NA introduced", collapse = "\n"))
    }
    if (verbose) {
      messages <- aliasSymbol$message[matched]
      messages <- messages[messages != ""]
      messages <- messages[!is.na(messages)]
      if (length(messages) > 0) {
        message(paste0(messages, collapse = "\n"))
      }
    }
    new.genes <- aliasSymbol$symbol[matched]
    if (verbose & !all(table(new.genes, useNA = "ifany") == 1)) {
      tab <- table(new.genes, useNA = "ifany")
      message(paste0("Duplicate HGNC symbol detected for ", names(tab[tab > 1]),
                     " (", tab[tab > 1], " times)"))
    }
    geneIds(x) <- unique(as.character(new.genes))
    x
  })
  return(GeneSetCollection(new.gene.set.collection))
}

 

Arguments are gene.set.collection, which needs to be of class GeneSetCollection (library: GSEABase; Bioconductor), as well as verbose (default: false). The function requires libraries org.Hs.eg.db (Bioconductor) and DBI (CRAN). While creating the lookup table might not be the fastest way for conversion of smaller GeneSetCollections, it dramatically speeds up performance on bigger GeneSetCollections. I hope this is helpful to you.

gseabase msigdb genesetcollection Tutorial • 2.2k views
ADD COMMENT

Login before adding your answer.

Traffic: 472 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6