Question

Help: obtain Entrez ID's from limma top table with Affymetrix microarray (pd.zebgene.1.1.st)

1

Entering edit mode

mat149 ▴ 80

@mat149-11450

Last seen 7 weeks ago

United States

Hello,

I am working with Affymetrix 1.1 ST zebrafish gene arrays. The script I have attached works but the pd.zebgene.1.1.st annotation package for these particular arrays do not contain Entrez ID's for each probeset on the platform. I have seen that expression sets can be annotated using, for instance, library(affycoretools) and library(hugene11sttranscriptcluster.db). My problem is that the zebrafish Affy 1.1 ST gene arrays do not contain transcriptcluster.db or probesetcluster.db packages that are available for the human/mouse/rat analogues. My question is: Can anyone explain/provide a reference as to how I can obtain Entrez ID's for my limma topTable results? I fear that I am not good enough at "R" to create my own annotation packages for this array.

When I apply the "annotateEset" function, the Entrez ID's are not returned in my topTable results which look like:

PROBEID	ID	SYMBOL	GENENAME	logFC	AveExpr	t	P.Value	adj.P.Val	B
13305218	AF109369	opn1mw1	opsin 1 (cone pigments), medium-wave-sensitive, 1	-6.1018	8.754473	-10.8657	1.08E-08	1.98E-05	10.13624
13187576	AF109373	opn1sw1	opsin 1 (cone pigments), short-wave-sensitive 1	-5.19865	8.99628	-9.02191	1.36E-07	9.19E-05	7.81309
13276068	NA	NA	NA	-4.76802	7.529454	-7.27885	2.11E-06	0.00053	5.207056
12988871	BC134193	gngt2b	guanine nucleotide binding protein (G protein), gamma transducing activity polypeptide 2b	-4.75024	7.834161	-11.5697	4.49E-09	1.21E-05	10.92152
13282126	BC059650	arr3b	arrestin 3b, retinal (X-arrestin)	-4.297	6.427947	-6.55812	7.38E-06	0.001151	3.99979

setwd("C:\\Users\\mat149\\Desktop\\GG")
library(oligo)
CELlist <- list.celfiles("C:\\Users\\mat149\\Desktop\\GG\\CEL", full.names=TRUE, pattern=NULL, all.file=FALSE)
pdat<-read.AnnotatedDataFrame("C:\\Users\\mat149\\Desktop\\GG\\CEL\\pheno16.txt",header=TRUE,row.name="Filename",sep="\t")
CELdat <- read.celfiles(filenames = CELlist,experimentData=TRUE,phenoData=pdat,verbose=TRUE)

ph = CELdat@phenoData
ph@data[ ,1] = c("WT1","WT2","WT3","WT4","WT5","WT6","WT7","WT8","lepA_MO1","lepA_MO2","lepA_MO3","lepA_MO4","lepA_rescue1","lepA_rescue2","lepA_rescue3","lepA_rescue4")
ph@data[ ,2] = c("CONTROL","CONTROL","CONTROL","CONTROL","CONTROL","CONTROL","CONTROL","CONTROL","morphant","morphant","morphant","morphant","rescue","rescue","rescue","rescue")
colnames(ph@data)[2]="Treatment"
colnames(ph@data)[1]="Sample"
groups = ph@data$Treatment
f = factor(groups,levels=c("CONTROL","morphant","rescue"))

eset<-rma(CELdat, background=TRUE, normalize=TRUE, subset=NULL, target="core")
featureData(eset)<-getNetAffx(eset, type = "transcript")
data.matrix = exprs(eset)
library(affycoretools)
librarypd.zebgene.1.1.st)
eset.annot <- annotateEset(eset, annotation(eset))
estt <- annotateEset(eset.annot, pd.zebgene.1.1.st,columns = c("PROBEID","ENSEMBL","ALIAS","ENTREZID","SYMBOL","GENENAME"),multivals = filter,annocols=1, probecol=2)

library(limma)
design <- model.matrix(~ 0+factor(c(1,1,1,1,1,1,1,1,2,2,2,2,3,3,3,3)))
colnames(design) <- c("CONTROL","lepA_MO","lepA_RS")
fit <- lmFit(estt, design)
contrast.matrix <- makeContrasts(lepA_MO-CONTROL, lepA_RS-CONTROL, lepA_MO-lepA_RS, levels=design)
fit2 <- contrasts.fit(fit, contrast.matrix)
fit2 <- eBayes(fit2)
results <- decideTests(fit2,p.value=0.01,adjust="BH",lfc=2)

MOWT<-topTable(fit2, coef=1,adjust="BH",p.value=0.01,lfc=2,sort.by="P",n=Inf)

affmetrix microarray limma oligo • 2.3k views

ADD COMMENT • link updated 9.0 years ago by James W. MacDonald 68k • written 9.0 years ago by mat149 ▴ 80

score 1 · Answer 1 · 2016-11-04

1

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 1 day ago

United States

Or you could use mapIds directly

fd <- fData(eset)

fd$ENTREZID <- mapIds(org.Dr.eg.db, fd$SYMBOL, "ENTREZID","SYMBOL")

fData(eset) <- fd

and then proceed.

ADD COMMENT • link 9.0 years ago James W. MacDonald 68k

0

Entering edit mode

Thanks for the reply. I have added your script but I am still getting an error message:

"Error in `$<-.data.frame`(`*tmp*`, "ENTREZID", value = list(`30037` = 14730L, :
replacement has 37237 rows, data has 75212"

The script I ran was:

setwd("C:\\Users\\mat149\\Desktop\\GG")
library(oligo)
CELlist <- list.celfiles("C:\\Users\\mat149\\Desktop\\GG\\CEL", full.names=TRUE, pattern=NULL, all.file=FALSE)
pdat<-read.AnnotatedDataFrame("C:\\Users\\mat149\\Desktop\\GG\\CEL\\pheno16.txt",header=TRUE,row.name="Filename",sep="\t")
CELdat <- read.celfiles(filenames = CELlist,experimentData=TRUE,phenoData=pdat,verbose=TRUE)

eset<-rma(CELdat, background=TRUE, normalize=TRUE, subset=NULL, target="core")
featureData(eset)<-getNetAffx(eset, type = "transcript")
data.matrix = exprs(eset)
library(affycoretools)
librarypd.zebgene.1.1.st)
eset.annot <- annotateEset(eset, annotation(eset))
eset1 <- annotateEset(eset.annot, pd.zebgene.1.1.st,columns = c("PROBEID","ZFIN","ENSEMBL","ALIAS","ENTREZID","SYMBOL","GENENAME"),multiVals = filter,annocols=1, probecol=2)

library(org.Dr.eg.db)
keys <- keys(org.Dr.eg.db)
fd <- fData(eset1)
fd$ENTREZID <- mapIds(org.Dr.eg.db,keys=keys,fd$SYMBOL,column="ENTREZID",keytype="ENTREZID", multiVals = list) -----> error after this line
fData(eset) <- fd

ADD REPLY • link 9.0 years ago mat149 ▴ 80

0

Entering edit mode

You are doing a bunch of extra things that don't really help your cause. This:

featureData(eset)<-getNetAffx(eset, type = "transcript")

and this

eset.annot <- annotateEset(eset, annotation(eset))

do very similar things. Is there a particular rationale to do both? And then you do this:

eset1 <- annotateEset(eset.annot, pd.zebgene.1.1.st,columns = c("PROBEID","ZFIN","ENSEMBL","ALIAS","ENTREZID","SYMBOL","GENENAME"),multiVals = filter,annocols=1, probecol=2)

which is a mixture of arguments for two different methods, and probably doesn't do what you think it is doing. In the end you have three (!) different ExpressionSets that are annotated using three closely related methods. You can really only use one at a time, so having three is unnecessarily complex.

The final error comes from the fact that you are doing something that makes no sense at all, and doesn't even remotely resemble what I recommended you do. You are trying to find the Entrez Gene ID for all the genes on your array, right? But then you do this:

keys <- keys(org.Dr.eg.db)

That gets all the Entrez Gene IDs in the org.Dr.eg.db package, which corresponds to all known D. rerio Entrez Gene IDs. You then do

mapIds(org.Dr.eg.db,keys=keys,fd$SYMBOL,column="ENTREZID",keytype="ENTREZID", multiVals = list)

Which, translated to English is saying 'give me all the Entrez Gene IDs for D. rerio that are in the org.Dr.eg.db, based on this set of Entrez Gene IDs that I just extracted from that package'. The fact that you put fd$SYMBOL in that function call, without using an argument, means that it is in essence ignored.

You can use positional argument matching, where you put the arguments in the correct position, or you can use direct argument matching with the equals sign, but using a mixture of both will often have unintended consequences.

I gave you three lines of code that would have done exactly what you wanted to do, but you didn't use it! This doesn't have to be as complicated as you are making it, and you should NEVER be accessing S4 slots directly using the @ symbol! Rather than doing all this extra stuff that is not helpful, you can do what you want in very few lines of code.

This is what your script should look like:

setwd("C:\\Users\\mat149\\Desktop\\GG")
library(oligo)
CELdat <- read.celfiles(list.celfiles())
eset<-rma(CELdat)
library(affycoretools)
eset <- annotateEset(eset, annotation(eset))
## extra annotation
fd <- fData(eset)
fd$ENTREZID <- mapIds(org.Dr.eg.db, fd$SYMBOL, "ENTREZID","SYMBOL")
fData(eset) <- fd
## modeling
library(limma)
grps <- factor(rep(c("WT","lepA_MO","lepA_rescue"), c(8,4,4)))
design <- model.matrix(~ 0+grps)
colnames(design) <- gsub("grps", "", colnames(design))
fit <- lmFit(eset, design)
contrast.matrix <- makeContrasts(lepA_MO-CONTROL, lepA_RS-CONTROL, lepA_MO-lepA_RS, levels=design)
fit2 <- contrasts.fit(fit, contrast.matrix)
fit2 <- eBayes(fit2)
results <- decideTests(fit2,p.value=0.01,adjust="BH",lfc=2)

MOWT<-topTable(fit2, coef=1, adjust="BH", p.value=0.01, lfc=2, sort.by="P", n=Inf)

Note that adding in all the phenoData isn't strictly necessary, ESPECIALLY if you add it all in and then never use it for anything! Your goal should be to simplify what you do, limit extra steps, and understand what you are doing and why.

ADD REPLY • link 9.0 years ago James W. MacDonald 68k

0

Entering edit mode

Hello James, thanks again for your help. I am not a computer scientist by any means so forgive my ignorance. I have followed the lines of code you have listed but I ran into this error when I tried to run the line:

fd$ENTREZID <- mapIds(org.Dr.eg.db, fd$SYMBOL, "ENTREZID","SYMBOL")

Error in .testForValidKeys(x, keys, keytype, fks) :
'keys' must be a character vector

I think I need to tell AnnotationDbi which keys/keytype from the expression set feature data to map to Entrez ID's available in org.Dre.eg.db, but I dont know how to do that (the eset contains "PROBEID" identifiers in the first column). To answer your question, yes I am trying to obtain Entrez ID's for all genes on the array.

ADD REPLY • link 9.0 years ago mat149 ▴ 80

0

Entering edit mode

You are misinterpreting the problem. The error you get is

Error in .testForValidKeys(x, keys, keytype, fks) : 
  'keys' must be a character vector

Which is telling you that the 'keys' argument needs to be a character vector. If we look at the arguments for mapIds, we have:

> args(mapIds)
function (x, keys, column, keytype, ..., multiVals)

So the 'keys' argument is the second argument, for which you are passing fd$SYMBOL. And the error message is telling you that this must be character. You could hypothetically test to see what class that is, using class(fd$SYMBOL) (it's a factor), but that won't tell you how to 'fix' the situation. BUT Google will, and Google is almost always your best friend! Part of being an informed programmer is knowing how to answer questions. A simple Google search for 'R convert factor to character' will turn up any number of links that will tell you to use as.character.

At which point you will know that the answer is

fd$ENTREZID <- mapIds(org.Dr.eg.db, as.character(fd$SYMBOL), "ENTREZID","SYMBOL")

ADD REPLY • link 9.0 years ago James W. MacDonald 68k

0

Entering edit mode

Thank you very much; this worked and I also learned a few neat things in the process of you explaining my fails.

Vouch for James!

ADD REPLY • link 9.0 years ago mat149 ▴ 80