Question

How to annotate the [MoGene-2_0-st] Affymetrix Mouse Gene 2.0 ST Array chip

0

Entering edit mode

llkxiaolan ▴ 10

@llkxiaolan-13767

Last seen 6.0 years ago

I am a Chinese student, English is not very good, some places may not be clear, I hope you can understand

Recently, I've been using oligo packages to analyze Affymetrix Mouse Gene 2 .0 ST Array chips. But I'm not going to convert the probe's ID into the ID of the gene. This problem has been bothering me for a long time. I checked some information and didn't solve it. Is there anyone who can help me? Thank you very much .

Here's the code I'm using ：

library(oligo)

celFiles <- list.celfiles()

affyRaw <- read.celfiles(celFiles)

librarypd.mogene.2.0.st)

eset <- rma(affyRaw)

library(limma)

design <- model.matrix(~ 0+factor(c(1,1,1,2,2,2)))
colnames(design) <- c("group1", "group2")
contrast.matrix <- makeContrasts(contrasts="group2-group1",levels=design)
design
fit <- lmFit(eset, design)
fit1<- contrasts.fit(fit, contrast.matrix)
fit2 <- eBayes(fit1)

dif<-topTable(fit2,coef="group2-group1",n=nrow(fit2),lfc=log2(2))
dif<-dif[dif[,"adj.P.Val"]<0.05,]
head(dif)

I can only do it here, how to do the ID conversion, I can not do it, can anyone help me, thank you again

annotation oligo • 4.3k views

ADD COMMENT • link updated 6.4 years ago by Guido Hooiveld ★ 3.9k • written 6.4 years ago by llkxiaolan ▴ 10

score 0 · Answer 1 · 2017-11-16

0

Entering edit mode

Guido Hooiveld ★ 3.9k

@guido-hooiveld-2020

Last seen 2 hours ago

Wageningen University, Wageningen, the …

Most convenient would be using the function annotateEset() from the package affycoretools. Use as input the (your) normalized object eset. You can then annotate your dataset using either the corresponding pdInfo package, or the ChipDb package.

The annotation info available in the PdInfo package is basically a 1:1 copy of the info made available by Affymetrix on their support pages (in e.g. the file MoGene-2_0-st-v1.na36.mm10.transcript.csv). The the latter (ChipDb) is fully generated using the Bioconductor infrastructure; only the mapping probeset -> gene ID is extracted from the before-mentioned csv file. Thus:

library(affycoretools)

# using the PdInfo package
eset.anno1 <- annotateEset(eset, pd.mogene.2.0.st)

# using the ChipDb package
library(mogene20sttranscriptcluster.db)
eset.anno2 <- annotateEset(eset, mogene20sttranscriptcluster.db)

Then continue with the analysis in limma using the object eset.annox, the annotation info will be automagically added to the limma output.

ADD COMMENT • link 6.4 years ago Guido Hooiveld ★ 3.9k

0

Entering edit mode

First of all, thank you very much for your answer, but I have done it according to your method. After that, it seems that the problem has not been solved, and there are many NA values. I don't know what caused it.

	PROBEID	ID	SYMBOL	GENENAME	logFC	AveExpr	t	P.Value	adj.P.Val	B
17203807	17203807	NA	NA	NA	-0.63256	1.160775	-8.28233	2.47E-05	0.6607	-2.37527
17201831	17201831	NA	NA	NA	-0.81393	3.886475	-7.03522	8.36E-05	0.6607	-2.50944
17278777	17278777	NR_046306	DQ267102	snoRNA DQ267102	0.848883	2.678946	6.954444	9.10E-05	0.6607	-2.52004
17207623	17207623	NA	NA	NA	-1.18329	2.450527	-6.82897	0.000104	0.6607	-2.53705
17507910	17507910	NM_007844	Defa-rs1	defensin, alpha, related sequence 1	0.699768	5.049872	6.689652	0.000121	0.6607	-2.55676
17207769	17207769	NA	NA	NA	0.940615	2.023429	6.606085	0.000132	0.6607	-2.56902
17202349	17202349	NA	NA	NA	0.838046	5.041423	6.378744	0.00017	0.6607	-2.6041
17205531	17205531	NA	NA	NA	1.06037	2.07063	6.301779	0.000185	0.6607	-2.61658
17548311	17548311	AK002956	Edv	endogenous sequence related to the Duplan murine retrovirus	0.522954	10.88666	6.270509	0.000192	0.6607	-2.62174
17548313	17548313	AK002956	Edv	endogenous sequence related to the Duplan murine retrovirus	0.522954	10.88666	6.270509	0.000192	0.6607	-2.62174
17548642	17548642	AK002956	Edv	endogenous sequence related to the Duplan murine retrovirus	0.522954	10.88666	6.270509	0.000192	0.6607	-2.62174
17548644	17548644	AK002956	Edv	endogenous sequence related to the Duplan murine retrovirus	0.522954	10.88666	6.270509	0.000192	0.6607	-2.62174
17357560	17357560	NA	NA	NA	-2.66352	3.567527	-6.10183	0.000232	0.708912	-2.65052

ADD REPLY • link 6.4 years ago llkxiaolan ▴ 10

0

Entering edit mode

Well, I don't fully agree with you. Your annotation 'problem' HAS been solved, because SYMBOLs and GENENAMEs were retrieved and added to your output. I agree with you regarding the many NA's that are present. However, this has (solely) to do with the limited annotation information Affymetrix provides for this array. In other words, you have to 'blame' Affymetrix for providing such poorly annotated csv file... (which is the basis of all annotation files).

In this thread A: affycoretools annotateEset problem using Clariom D arrays James MacDonald provides an informative line of code that will show you the fraction of your data that could be annotated:

apply(fData(eset.anno2), 2, function(x) sum(!is.na(x))/length(x))

To reduce the number of not-annotated probeids you might considering to use the so-called custom-defined array definitions made by Manhong Dai from the Brain Array group here. Manhong remaps all probes present on the array to a current genome build available at e.g. the NCBI or ENSEMBL databases. In addition of filtering out probes that are not specific, another advantage is that (almost) all probeids are annotated. If you would like to go that way, below some code to get you started (note: this code uses the remapped probes based on the ENTREZG database from NCBI):

#Install required packages, assuming you are using Windows
install.packages("http://mbni.org/customcdf/22.0.0/entrezg.download/pd.mogene20st.mm.entrezg_22.0.0.zip", repos = NULL)
install.packages("http://mbni.org/customcdf/22.0.0/entrezg.download/mogene20stmmentrezg.db_22.0.0.zip", repos = NULL)

library(pd.mogene20st.mm.entrezg)
celFiles <- list.celfiles()
affyRaw <- read.celfiles(celFiles, pkgname = "pd.mogene20st.mm.entrezg")
eset <- rma(affyRaw)

library(mogene20stmmentrezg.db)
eset.anno3 <- annotateEset(eset, mogene20stmmentrezg.db)

ADD REPLY • link 6.4 years ago Guido Hooiveld ★ 3.9k

0

Entering edit mode

Thank you very much for your reply. I'll take a closer look at it. Thank you very much

ADD REPLY • link 6.4 years ago llkxiaolan ▴ 10

0

Entering edit mode

Sorry, there's another question I'd like to ask you .I used the code above to annotate the data .But there are some small problems in the result .

	PROBEID	ID	SYMBOL	GENENAME	logFC	AveExpr	t	P.Value	adj.P.Val	B
17210850	17210850	ENSMUST00000082908	Gm26206	predicted gene, 26206	0.018637	1.100376	0.180266	0.861197	0.996585	-4.9008
17210852	17210852	XR_398539	LOC102640548	uncharacterized LOC102640548	-0.02858	1.205122	-0.20729	0.840699	0.995097	-4.89808
17210855	17210855	NM_008866	Lypla1	lysophospholipase 1	0.008326	9.665614	0.050704	0.960741	0.998727	-4.90858
17210869	17210869	NM_001159750	Tcea1	transcription elongation factor A (SII) 1	0.210958	8.376269	1.464098	0.179386	0.969223	-4.4408
17210883	17210883	XR_373197	LOC102631647	uncharacterized LOC102631647	0.08286	2.004754	0.841665	0.423177	0.972998	-4.7356
17210887	17210887	NM_133826	Atp6v1h	ATPase, H+ transporting, lysosomal V1 subunit H	0.01326	7.819807	0.180675	0.860886	0.996585	-4.90076

What does the XR-398539 mean in the column of ID?And, in the result, there are some annotated names of genes, but there is no name in the GPL annotation file. What's the reason?

Sorry, my English is not very good, you know my description of the problem you have read?

ADD REPLY • link 6.4 years ago llkxiaolan ▴ 10

0

Entering edit mode

Mmm, you also need to explore things yourselves a bit...

XR is one of the 9 RefSeq annotation categories; the abbreviation XR is used to describe a 'predicted ncRNA model' that has been given the (numerical) ID 398539. Please note that this is a computational prediction, so no experimental evidence does (yet) exist for this gene (model) to exist. See also: https://en.wikipedia.org/wiki/RefSeq (or if that link will not work for you here or here).

Regarding the absence of info in the GPL annotation file: I think this has to do with the fact that the annotation info at GEO was last updated in 2013 (Jan 30, 2013: annotation table updated with netaffx build 33), whereas the PdInfo package has been created with the latest Affymetrix information available, which is from January 2017 (netaffx build 36). In other words, the annoation info available at GEO is outdated.

ADD REPLY • link 6.4 years ago Guido Hooiveld ★ 3.9k

0

Entering edit mode

Thank you very much for your answer. I am a self-taught biological information, the school teachers and students are not very well understood, so there are many problems can not be solved, only online help. I will find some information to learn, thank you very much for your help

ADD REPLY • link 6.4 years ago llkxiaolan ▴ 10