Question: Help: obtain Entrez ID's from limma top table with Affymetrix microarray (pd.zebgene.1.1.st)
1
3.0 years ago by
mat14940
mat14940 wrote:

Hello,

I am working with Affymetrix 1.1 ST zebrafish gene arrays.  The script I have attached works but the pd.zebgene.1.1.st annotation package for these particular arrays do not contain Entrez ID's for each probeset on the platform.  I have seen that expression sets can be annotated using, for instance, library(affycoretools) and library(hugene11sttranscriptcluster.db).  My problem is that the zebrafish Affy 1.1 ST gene arrays do not contain transcriptcluster.db or probesetcluster.db packages that are available for the human/mouse/rat analogues.  My question is: Can anyone explain/provide a reference as to how I can obtain Entrez ID's for my limma topTable results? I fear that I am not good enough at "R" to create my own annotation packages for this array.

When I apply the "annotateEset" function, the Entrez ID's are not returned in my topTable results which look like:

 PROBEID ID SYMBOL GENENAME logFC AveExpr t P.Value adj.P.Val B 13305218 AF109369 opn1mw1 opsin 1 (cone pigments), medium-wave-sensitive, 1 -6.1018 8.754473 -10.8657 1.08E-08 1.98E-05 10.13624 13187576 AF109373 opn1sw1 opsin 1 (cone pigments), short-wave-sensitive 1 -5.19865 8.99628 -9.02191 1.36E-07 9.19E-05 7.81309 13276068 NA NA NA -4.76802 7.529454 -7.27885 2.11E-06 0.00053 5.207056 12988871 BC134193 gngt2b guanine nucleotide binding protein (G protein), gamma transducing activity polypeptide 2b -4.75024 7.834161 -11.5697 4.49E-09 1.21E-05 10.92152 13282126 BC059650 arr3b arrestin 3b, retinal (X-arrestin) -4.297 6.427947 -6.55812 7.38E-06 0.001151 3.99979

setwd("C:\\Users\\mat149\\Desktop\\GG")
library(oligo)
CELlist <- list.celfiles("C:\\Users\\mat149\\Desktop\\GG\\CEL", full.names=TRUE, pattern=NULL, all.file=FALSE)

ph = CELdat@phenoData
ph@data[ ,1] = c("WT1","WT2","WT3","WT4","WT5","WT6","WT7","WT8","lepA_MO1","lepA_MO2","lepA_MO3","lepA_MO4","lepA_rescue1","lepA_rescue2","lepA_rescue3","lepA_rescue4")
ph@data[ ,2] = c("CONTROL","CONTROL","CONTROL","CONTROL","CONTROL","CONTROL","CONTROL","CONTROL","morphant","morphant","morphant","morphant","rescue","rescue","rescue","rescue")
colnames(ph@data)[2]="Treatment"
colnames(ph@data)[1]="Sample"
groups = ph@data$Treatment f = factor(groups,levels=c("CONTROL","morphant","rescue")) eset<-rma(CELdat, background=TRUE, normalize=TRUE, subset=NULL, target="core") featureData(eset)<-getNetAffx(eset, type = "transcript") data.matrix = exprs(eset) library(affycoretools) librarypd.zebgene.1.1.st) eset.annot <- annotateEset(eset, annotation(eset)) estt <- annotateEset(eset.annot, pd.zebgene.1.1.st,columns = c("PROBEID","ENSEMBL","ALIAS","ENTREZID","SYMBOL","GENENAME"),multivals = filter,annocols=1, probecol=2) library(limma) design <- model.matrix(~ 0+factor(c(1,1,1,1,1,1,1,1,2,2,2,2,3,3,3,3))) colnames(design) <- c("CONTROL","lepA_MO","lepA_RS") fit <- lmFit(estt, design) contrast.matrix <- makeContrasts(lepA_MO-CONTROL, lepA_RS-CONTROL, lepA_MO-lepA_RS, levels=design) fit2 <- contrasts.fit(fit, contrast.matrix) fit2 <- eBayes(fit2) results <- decideTests(fit2,p.value=0.01,adjust="BH",lfc=2) MOWT<-topTable(fit2, coef=1,adjust="BH",p.value=0.01,lfc=2,sort.by="P",n=Inf) ADD COMMENTlink modified 3.0 years ago by James W. MacDonald51k • written 3.0 years ago by mat14940 Answer: Help: obtain Entrez ID's from limma top table with Affymetrix microarray (pd.zeb 1 3.0 years ago by United States James W. MacDonald51k wrote: Or you could use mapIds directly fd <- fData(eset) fd$ENTREZID <- mapIds(org.Dr.eg.db, fd$SYMBOL, "ENTREZID","SYMBOL") fData(eset) <- fd and then proceed. ADD COMMENTlink written 3.0 years ago by James W. MacDonald51k Thanks for the reply. I have added your script but I am still getting an error message: "Error in $<-.data.frame(*tmp*, "ENTREZID", value = list(30037 = 14730L,  :
replacement has 37237 rows, data has 75212"

The script I ran was:

setwd("C:\\Users\\mat149\\Desktop\\GG")
library(oligo)
CELlist <- list.celfiles("C:\\Users\\mat149\\Desktop\\GG\\CEL", full.names=TRUE, pattern=NULL, all.file=FALSE)

eset<-rma(CELdat, background=TRUE, normalize=TRUE, subset=NULL, target="core")
featureData(eset)<-getNetAffx(eset, type = "transcript")
data.matrix = exprs(eset)
library(affycoretools)
librarypd.zebgene.1.1.st)
eset.annot <- annotateEset(eset, annotation(eset))
eset1 <- annotateEset(eset.annot, pd.zebgene.1.1.st,columns = c("PROBEID","ZFIN","ENSEMBL","ALIAS","ENTREZID","SYMBOL","GENENAME"),multiVals = filter,annocols=1, probecol=2)

library(org.Dr.eg.db)
keys <- keys(org.Dr.eg.db)
fd <- fData(eset1)
fd$ENTREZID <- mapIds(org.Dr.eg.db,keys=keys,fd$SYMBOL,column="ENTREZID",keytype="ENTREZID", multiVals = list) -----> error after this line
fData(eset) <- fd

featureData(eset)<-getNetAffx(eset, type = "transcript")

and this

eset.annot <- annotateEset(eset, annotation(eset))

do very similar things. Is there a particular rationale to do both? And then you do this:

eset1 <- annotateEset(eset.annot, pd.zebgene.1.1.st,columns = c("PROBEID","ZFIN","ENSEMBL","ALIAS","ENTREZID","SYMBOL","GENENAME"),multiVals = filter,annocols=1, probecol=2)

which is a mixture of arguments for two different methods, and probably doesn't do what you think it is doing. In the end you have three (!) different ExpressionSets that are annotated using three closely related methods. You can really only use one at a time, so having three is unnecessarily complex.

The final error comes from the fact that you are doing something that makes no sense at all, and doesn't even remotely resemble what I recommended you do. You are trying to find the Entrez Gene ID for all the genes on your array, right? But then you do this:

keys <- keys(org.Dr.eg.db)

That gets all the Entrez Gene IDs in the org.Dr.eg.db package, which corresponds to all known D. rerio Entrez Gene IDs. You then do

mapIds(org.Dr.eg.db,keys=keys,fd$SYMBOL,column="ENTREZID",keytype="ENTREZID", multiVals = list) Which, translated to English is saying 'give me all the Entrez Gene IDs for D. rerio that are in the org.Dr.eg.db, based on this set of Entrez Gene IDs that I just extracted from that package'. The fact that you put fd$SYMBOL in that function call, without using an argument, means that it is in essence ignored.

You can use positional argument matching, where you put the arguments in the correct position, or you can use direct argument matching with the equals sign, but using a mixture of both will often have unintended consequences.

I gave you three lines of code that would have done exactly what you wanted to do, but you didn't use it! This doesn't have to be as complicated as you are making it, and you should NEVER be accessing S4 slots directly using the @ symbol! Rather than doing all this extra stuff that is not helpful, you can do what you want in very few lines of code.

This is what your script should look like:

setwd("C:\\Users\\mat149\\Desktop\\GG")
library(oligo)
eset<-rma(CELdat)
library(affycoretools)
eset <- annotateEset(eset, annotation(eset))
## extra annotation
fd <- fData(eset)
fd$ENTREZID <- mapIds(org.Dr.eg.db, fd$SYMBOL, "ENTREZID","SYMBOL")
fData(eset) <- fd
## modeling
library(limma)
grps <- factor(rep(c("WT","lepA_MO","lepA_rescue"), c(8,4,4)))
design <- model.matrix(~ 0+grps)
colnames(design) <- gsub("grps", "", colnames(design))
fit <- lmFit(eset, design)
contrast.matrix <- makeContrasts(lepA_MO-CONTROL, lepA_RS-CONTROL, lepA_MO-lepA_RS, levels=design)
fit2 <- contrasts.fit(fit, contrast.matrix)
fit2 <- eBayes(fit2)

MOWT<-topTable(fit2, coef=1, adjust="BH", p.value=0.01, lfc=2, sort.by="P", n=Inf)

Note that adding in all the phenoData isn't strictly necessary, ESPECIALLY if you add it all in and then never use it for anything! Your goal should be to simplify what you do, limit extra steps, and understand what you are doing and why.

Hello James, thanks again for your help. I am not a computer scientist by any means so forgive my ignorance.  I have followed the lines of code you have listed but I ran into this error when I tried to run the line:

fd$ENTREZID <- mapIds(org.Dr.eg.db, fd$SYMBOL, "ENTREZID","SYMBOL")

Error in .testForValidKeys(x, keys, keytype, fks) :
'keys' must be a character vector

I think I need to tell AnnotationDbi which keys/keytype from the expression set feature data to map to Entrez ID's available in org.Dre.eg.db, but I dont know how to do that (the eset contains "PROBEID" identifiers in the first column).  To answer your question, yes I am trying to obtain Entrez ID's for all genes on the array.

You are misinterpreting the problem. The error you get is

Error in .testForValidKeys(x, keys, keytype, fks) :
'keys' must be a character vector

Which is telling you that the 'keys' argument needs to be a character vector. If we look at the arguments for mapIds, we have:

> args(mapIds)
function (x, keys, column, keytype, ..., multiVals)

So the 'keys' argument is the second argument, for which you are passing fd$SYMBOL. And the error message is telling you that this must be character. You could hypothetically test to see what class that is, using class(fd$SYMBOL) (it's a factor), but that won't tell you how to 'fix' the situation. BUT Google will, and Google is almost always your best friend! Part of being an informed programmer is knowing how to answer questions. A simple Google search for 'R convert factor to character' will turn up any number of links that will tell you to use as.character.

At which point you will know that the answer is

fd$ENTREZID <- mapIds(org.Dr.eg.db, as.character(fd$SYMBOL), "ENTREZID","SYMBOL")

Thank you very much; this worked and I also learned a few neat things in the process of you explaining my fails.

Vouch for James!