Search
Tutorial: Update gene symbols in GeneSetCollection
0
gravatar for t.kuilman
3 months ago by
t.kuilman140
Netherlands
t.kuilman140 wrote:

I recently run into the problem that the gene annotation in MSigDB is not up to date (for example, the alias C14orf129 is used as a gene name in various gene sets, while the official HGNC symbol would be GSKIP). To fix this I have written a function to update all gene symbols:

 

alias.to.hgnc.gene.set.collection <- function(gene.set.collection, verbose = FALSE) {
  if (class(gene.set.collection) != "GeneSetCollection") {
    stop("Input is not a GeneSetCollection")
  }

  ## Get alias list using Duff's answer at https://www.biostars.org/p/14971/
  library(org.Hs.eg.db)
  dbCon <- org.Hs.eg_dbconn()
  sqlQuery <- "SELECT * FROM alias, gene_info WHERE alias._id == gene_info._id;"
  library(DBI)
  aliasSymbol <- dbGetQuery(dbCon, sqlQuery)

  ## Make a lookup table (aliasSymbol) with messages included
  ## Alias_symbol column ALWAYS contains also official HGNC symbol for every gene
  aliasSymbol <- aliasSymbol[, c("alias_symbol", "symbol")]
  aliasSymbol <- aliasSymbol[!duplicated(aliasSymbol), ]
  aliasSymbol[, "message"] <- ""
  duplicated.aliases <- unique(aliasSymbol$alias_symbol[duplicated(aliasSymbol$alias_symbol)])
  # Remove second line for following example:
  # alias_symbol  symbol
  # ACTBP2        ACTBP2
  # ACTBP2        ACTBP8
  indices <- unlist(lapply(intersect(duplicated.aliases, aliasSymbol$symbol), function(x) {
    tmp.index <- which(aliasSymbol$alias_symbol == x)
    return(tmp.index[aliasSymbol$symbol[tmp.index] != x])
  }))
  aliasSymbol <- aliasSymbol[-indices, ]
  # Remove second line for following example:
  # alias_symbol symbol
  # DUX4L        DUX4
  # DUX4L        LOC112268343
  indices <- NULL
  for (gene in setdiff(duplicated.aliases, aliasSymbol$symbol)) {
    tmp.index <- which(aliasSymbol$alias_symbol == gene)
    aliasSymbol$message[tmp.index[1]] <- paste0("Ambiguous alias name detected: ", gene,
                             "; NA introduced")
    aliasSymbol$symbol[tmp.index[1]] <- NA
    indices <- c(indices, tmp.index[2:length(tmp.index)])
  }
  aliasSymbol <- aliasSymbol[-indices, ]

  ## Use lookup table to create newly annotated gene set collection
  new.gene.set.collection <- lapply(gene.set.collection, function(x) {
    genes <- geneIds(x)
    matched <- match(genes, aliasSymbol$alias_symbol)
    if (verbose & sum(is.na(matched)) > 0) {
      message(paste0("Missing alias / HGNC name detected: ", genes[which(is.na(matched))],
                     "; NA introduced", collapse = "\n"))
    }
    if (verbose) {
      messages <- aliasSymbol$message[matched]
      messages <- messages[messages != ""]
      messages <- messages[!is.na(messages)]
      if (length(messages) > 0) {
        message(paste0(messages, collapse = "\n"))
      }
    }
    new.genes <- aliasSymbol$symbol[matched]
    if (verbose & !all(table(new.genes, useNA = "ifany") == 1)) {
      tab <- table(new.genes, useNA = "ifany")
      message(paste0("Duplicate HGNC symbol detected for ", names(tab[tab > 1]),
                     " (", tab[tab > 1], " times)"))
    }
    geneIds(x) <- unique(as.character(new.genes))
    x
  })
  return(GeneSetCollection(new.gene.set.collection))
}

 

Arguments are gene.set.collection, which needs to be of class GeneSetCollection (library: GSEABase; Bioconductor), as well as verbose (default: false). The function requires libraries org.Hs.eg.db (Bioconductor) and DBI (CRAN). While creating the lookup table might not be the fastest way for conversion of smaller GeneSetCollections, it dramatically speeds up performance on bigger GeneSetCollections. I hope this is helpful to you.

ADD COMMENTlink modified 3 months ago • written 3 months ago by t.kuilman140
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 143 users visited in the last hour