I wrote the parser that was used to generate that ChipDb
package. The basic idea is that you first need a text file that has the probeset ID in one column and another ID in the second column. I chose to use the GenBank and RefSeq IDs that are provided in the mrna_assignment column. After generating that file, I used makeDBPackage
from the AnnotationForge
package to make the clariomdhumantranscriptome.db
package. Part of that process involves mapping the GenBank and RefSeq IDs to NCBI Gene IDs, which is probably where the differences arose. IIRC, the last time we actually generated these packages was maybe 2015 or so, and in the intervening period have been simply updating the version number. This is mainly due to the fact that A.) Affy have not updated their files since 2013, and B.) almost nobody uses microarrays any longer. So our efforts have been directed towards more modern methods.
Looking back at the code I used, I originally parsed out the NCBI Gene IDs from the gene_assignment column of the csv file, but then switched to using the mrna_assignment column. To use the gene_assignment column you could get the csv file from fisher and use this code:
parseCsvFiles <- function(csv, fname){
dat <- read.csv(csv, comment.char = "#", stringsAsFactors=FALSE, na.string = "---")
if(!all(c(rna, dna) %in% names(dat)))
stop("Check the headers for file", csv, "they don't include", rna, "and", dna, "!")
egs <- lapply(strsplit(dat[,dna], " /// "), function(x) sapply(strsplit(x, " // "), function(y), y[length(y)]))
egs <- lapply(egs, function(x) x[!duplicated(x) & x != "---"])
egs <- data.frame(probeids = rep(dat[,1], sapply(egs, length)), egids = unlist(egs))
## add back missing probesets
toadd <- data.frame(probeids = dat[!dat[,1] %in% egs[,1],1], egids = rep(NA, sum(!dat[,1] %in% egs[,1])))
egs <- rbind(egs, toadd)
write.table(egs, fname, sep = "\t", na = "", row.names = FALSE, col.names = FALSE, quote = FALSE)
}
library(BiocManager)
install("human.db0")
parseCsvFiles(" Clariom_D_Human.na36.hg38.transcript.csv", "text.txt")
library(AnnotationForge)
makeDBPackage("HUMANCHIP_DB", affy = FALSE, prefix = "clariomdhumantranscriptcluster", fileName = "text.txt", baseMapType = "eg", version = "0.0.1", manufacturer = "Affymetrix", chipName = "clariomdhuman")
install.packages("clariomdhumantranscriptcluster.db", repos = NULL) ## if you are on windows, add type = "source"
I didn't test that code so caveat emptor. You may need to play around with it, but it should be close.
Thank you very much! I've also received the annotation file from ThermoFisher, I can just merge my expression set with that file.