I want to map probe ids of the Affymetrix HG-U133_Plus_2 Array to Ensembl gene ids using the package hgu133plus2.db
. There are a lot of genes that have multiple probe identifiers assigned to them. And there are also probe ids that have multiple Ensembl gene ids assigned to them (those I am removing right now). I wonder what the best approach is to select the best expression value that best represents the expression of a gene? Or is aggregating them by mean the better way to go? I guess one could do this using the probe id suffixes.
Suffixes included in hgu133plus2.db
: "s_at" "at" "g_at" "i_at" "f_at" "a_at" "x_at" "r_at" "3_at" "5_at" "M_at" "MA_at" "MB_at" "alu_at"
library(hgu133plus2.db)
library(stringi)
anno = AnnotationDbi::select(hgu133plus2.db,
keys = keys(hgu133plus2.db, keytype = "PROBEID"),
keytype = "PROBEID",
columns = c("ENSEMBL"))
suffixes = unique(unlist(lapply(anno$PROBEID, function(x) stringi::stri_split_fixed(str = x, pattern = "_", n = 2, simplify = TRUE)[2])))