Question

How to filter probesets from the expression set that do not contain Entrez/Gene Symbol identifiers

0

Entering edit mode

mat149 ▴ 80

@mat149-11450

Last seen 3 months ago

United States

Hello,

I am running a limma linear model analysis using expression data produced from affymetrix 1.1 st zebrafish gene arrays. I am filtering differentially expressed genes between three contrasts and when I call the topTable function, many probesets with NULL/NA identifiers are returned (see below). I would like to filter all probesets containing N/A from the expression set object. I have read about using the genefilter package but when I run the script I get an error message that I am having trouble interpreting.

eset<-rma(CELdat, background=TRUE, normalize=TRUE, subset=NULL, target="core")     
library(affycoretools)
eset <- annotateEset(eset, annotation(eset))

library(org.Dr.eg.db)
fd <- fData(eset)
fd$ENTREZID <- mapIds(org.Dr.eg.db, as.character(fd$SYMBOL), "ENTREZID","SYMBOL",multiVals="first")
fData(eset) <- fd

eset<- nsFilter(eset, require.entrez=TRUE,remove.dupEntrez=TRUE,feature.exclude="^AFFX")$eset
##ERROR MESSAGE HERE:
Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function 'columns' for signature '"AffyGenePDInfo"'

library(limma)
design = model.matrix(~ 0 + f)
colnames(design)=c("control","morphant","rescue")
design
data.fit = lmFit(eset,design)
contrast.matrix = makeContrasts(morphant-control,rescue-control,morphant-rescue,levels=design)
data.fit.con = contrasts.fit(data.fit,contrast.matrix)
data.fit.eb = eBayes(data.fit.con)

TT<-topTable(data.fit.eb,coef=1,number=Inf,adjust="BH",p.value=0.01,lfc=1.5)

	PROBEID	ID	SYMBOL	GENENAME	logFC	AveExpr	t	P.Value	adj.P.Val	B
13298616	13298616	#N/A	#N/A	#N/A	3.709468	3.126387	26.0062	2.7E-14	2.03E-09	21.27417
13032560	13032560	ENSDART00000149240	SI:DKEY-24I24.3	---	3.599239	2.981615	24.849	5.42E-14	2.04E-09	20.78372
12930895	12930895	#N/A	#N/A	#N/A	4.583036	4.352965	22.63093	2.25E-13	4.24E-09	19.73525
12934103	12934103	#N/A	#N/A	#N/A	4.583036	4.352965	22.63093	2.25E-13	4.24E-09	19.73525
13281654	13281654	ENSDART00000129395	uvrag	UV radiation resistance associated gene	4.483284	5.017916	20.09213	1.37E-12	2.06E-08	18.32921
13243932	13243932	#N/A	#N/A	#N/A	2.114799	2.408192	19.44796	2.23E-12	2.8E-08	17.93177
13004206	13004206	#N/A	#N/A	#N/A	3.201574	5.338119	16.65424	2.27E-11	2.2E-07	15.98024

Secondary to filtering N/A probesets from the expression set, I cannot figure out how to write out my topTable after mapping the gene symbols to Entrez ID's.

TT<-topTable(data.fit.eb,coef=1,number=Inf,adjust="BH",p.value=0.01,lfc=1.5)

library(xlsx)

write.xlsx(TT,file="TT.xlsx",showNA=FALSE)

I get the error:

Error in if (is.na(value)) { : argument is of length zero
In addition: Warning message:
In is.na(value) : is.na() applied to non-(list or vector) of type 'NULL'

I hope that my questions are clear, if not I can attempt to state them more clearly (#1 - How to remove N/A probesets from the limma analysis and #2 - how to coerce the toptable object to a writeable file which contains probeset Entrez ID's ...mapping probeset gene symbols to entrez ID's blocks me from doing this and i dont know why). Thanks for any help you can provide. Best,

Matt

oligo • 2.5k views

ADD COMMENT • link updated 8.3 years ago by James W. MacDonald 68k • written 8.3 years ago by mat149 ▴ 80

score 0 · Answer 1 · 2016-12-19

0

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 5 hours ago

WEHI, Melbourne, Australia

Please don't use the genefilter package. It just takes something that is very simple and turns it into something hard.

To remove NA values, you just use standard subsetting in R. For example,

i <- is.na( fData(eset)$SYMBOL )
data.fit <- lmFit(eset[!i,], design)

Learning how to use standard R operations like this will pay dividends throughout your projects.

If there are still problems with this data, then you might ask a question of the annotateEset() author.

ADD COMMENT • link 8.3 years ago Gordon Smyth 52k

0

Entering edit mode

Thanks, Gordon, this code worked for removing the N/A's as i hoped. I am still considering removing duplicate identifiers... but I am uncertain how removing duplicate Symbol/Entrez ID's may influence the analysis.

ADD REPLY • link 8.3 years ago mat149 ▴ 80

0

Entering edit mode

Unless you have special needs, there is no need to remove multiple probes for the same gene. It's easy to do however if you need to.

ADD REPLY • link 8.3 years ago Gordon Smyth 52k

score 0 · Answer 2 · 2016-12-20

You should probably use getMainProbes first, as the top probeset that you show is a 'rescue' probeset, which by definition isn't particularly interesting. As Gordon notes, the genefilter package is rather complex, and it could be argued that some of the defaults, particularly for nsFilter, are not really that useful and are not intended for the random primer style arrays (such as the one you are using).

As an aside, the Affymetrix CSV files have been changed and now contain the Entrez Gene ID in an easily parseable format - I have updated annotateEset to reflect those changes - it will take a day or two for these changes to propagate through the build servers.

While it's tempting to want to remove all the NA probesets (after removing the controls), I am not sure this is really the best way to proceed. It certainly makes interpretation easier, but it assumes that all those apparently differentially expressed probesets that are not readily annotated are uninteresting. If you have filtered out any low-expressing probesets, and you still have lots of unannotated probesets in your top table, it might be worthwhile to try to figure out what you are measuring.