Search
Question: How to annotate the [MoGene-2_0-st] Affymetrix Mouse Gene 2.0 ST Array chip
0
12 months ago by
llkxiaolan0 wrote:

I am a Chinese student, English is not very good, some places may not be clear, I hope you can understand

Recently, I've been using oligo packages to analyze Affymetrix Mouse Gene 2 .0 ST Array chips. But I'm not going to convert the probe's ID into the ID of the gene. This problem has been bothering me for a long time. I checked some information and didn't solve it. Is there anyone who can help me? Thank you very much .

Here's the code I'm using ：

library(oligo)

celFiles <- list.celfiles()

librarypd.mogene.2.0.st)

eset <- rma(affyRaw)

library(limma)

design <- model.matrix(~ 0+factor(c(1,1,1,2,2,2)))
colnames(design) <- c("group1", "group2")
contrast.matrix <- makeContrasts(contrasts="group2-group1",levels=design)
design
fit <- lmFit(eset, design)
fit1<- contrasts.fit(fit, contrast.matrix)
fit2 <- eBayes(fit1)

dif<-topTable(fit2,coef="group2-group1",n=nrow(fit2),lfc=log2(2))

I can only do it here, how to do the ID conversion, I can not do it, can anyone help me, thank you again

modified 12 months ago by Guido Hooiveld2.3k • written 12 months ago by llkxiaolan0
0
12 months ago by
Guido Hooiveld2.3k
Wageningen University, Wageningen, the Netherlands
Guido Hooiveld2.3k wrote:

Most convenient would be using the function annotateEset() from the package affycoretools. Use as input the (your) normalized object eset. You can then annotate your dataset using either the corresponding pdInfo package, or the ChipDb package.

The annotation info available in the PdInfo package is basically a 1:1 copy of the info made available by Affymetrix on their support pages (in e.g. the file MoGene-2_0-st-v1.na36.mm10.transcript.csv). The the latter (ChipDb) is fully generated using the Bioconductor infrastructure; only the mapping probeset -> gene ID is extracted from the before-mentioned csv file. Thus:

library(affycoretools)

# using the PdInfo package
eset.anno1 <- annotateEset(eset, pd.mogene.2.0.st)

# using the ChipDb package
library(mogene20sttranscriptcluster.db)
eset.anno2 <- annotateEset(eset, mogene20sttranscriptcluster.db)

Then continue with the analysis in limma using the object eset.annox, the annotation info will be automagically added to the limma output.

First of all, thank you very much for your answer, but I have done it according to your method. After that, it seems that the problem has not been solved, and there are many NA values. I don't know what caused it.

 PROBEID ID SYMBOL GENENAME logFC AveExpr t P.Value adj.P.Val B 17203807 17203807 NA NA NA -0.63256 1.160775 -8.28233 2.47E-05 0.6607 -2.37527 17201831 17201831 NA NA NA -0.81393 3.886475 -7.03522 8.36E-05 0.6607 -2.50944 17278777 17278777 NR_046306 DQ267102 snoRNA DQ267102 0.848883 2.678946 6.954444 9.10E-05 0.6607 -2.52004 17207623 17207623 NA NA NA -1.18329 2.450527 -6.82897 0.000104 0.6607 -2.53705 17507910 17507910 NM_007844 Defa-rs1 defensin, alpha, related sequence 1 0.699768 5.049872 6.689652 0.000121 0.6607 -2.55676 17207769 17207769 NA NA NA 0.940615 2.023429 6.606085 0.000132 0.6607 -2.56902 17202349 17202349 NA NA NA 0.838046 5.041423 6.378744 0.00017 0.6607 -2.6041 17205531 17205531 NA NA NA 1.06037 2.07063 6.301779 0.000185 0.6607 -2.61658 17548311 17548311 AK002956 Edv endogenous sequence related to the Duplan murine retrovirus 0.522954 10.88666 6.270509 0.000192 0.6607 -2.62174 17548313 17548313 AK002956 Edv endogenous sequence related to the Duplan murine retrovirus 0.522954 10.88666 6.270509 0.000192 0.6607 -2.62174 17548642 17548642 AK002956 Edv endogenous sequence related to the Duplan murine retrovirus 0.522954 10.88666 6.270509 0.000192 0.6607 -2.62174 17548644 17548644 AK002956 Edv endogenous sequence related to the Duplan murine retrovirus 0.522954 10.88666 6.270509 0.000192 0.6607 -2.62174 17357560 17357560 NA NA NA -2.66352 3.567527 -6.10183 0.000232 0.708912 -2.65052

Well, I don't fully agree with you. Your annotation 'problem' HAS been solved, because SYMBOLs and GENENAMEs were retrieved and added to your output. I agree with you regarding the many NA's that are present. However, this has (solely) to do with the limited annotation information Affymetrix provides for this array. In other words, you have to 'blame' Affymetrix for providing such poorly annotated csv file... (which is the basis of all annotation files).

In this thread A: affycoretools annotateEset problem using Clariom D arrays James MacDonald provides an informative line of code that will show you the fraction of your data that could be annotated:

apply(fData(eset.anno2), 2, function(x) sum(!is.na(x))/length(x))

To reduce the number of not-annotated probeids you might considering to use the so-called custom-defined array definitions made by Manhong Dai from the Brain Array group here. Manhong remaps all probes present on the array to a current genome build available at e.g. the NCBI or ENSEMBL databases. In addition of filtering out probes that are not specific, another advantage is that (almost) all probeids are annotated. If you would like to go that way, below some code to get you started (note: this code uses the remapped probes based on the ENTREZG database from NCBI):

#Install required packages, assuming you are using Windows

library(pd.mogene20st.mm.entrezg)
celFiles <- list.celfiles()
affyRaw <- read.celfiles(celFiles, pkgname = "pd.mogene20st.mm.entrezg")
eset <- rma(affyRaw)

library(mogene20stmmentrezg.db)
eset.anno3 <- annotateEset(eset, mogene20stmmentrezg.db)

Thank you very much for your reply. I'll take a closer look at it. Thank you very much

Sorry, there's another question I'd like to ask you .I used the code above to annotate the data .But there are some small problems in the result .

 PROBEID ID SYMBOL GENENAME logFC AveExpr t P.Value adj.P.Val B 17210850 17210850 ENSMUST00000082908 Gm26206 predicted gene, 26206 0.018637 1.100376 0.180266 0.861197 0.996585 -4.9008 17210852 17210852 XR_398539 LOC102640548 uncharacterized LOC102640548 -0.02858 1.205122 -0.20729 0.840699 0.995097 -4.89808 17210855 17210855 NM_008866 Lypla1 lysophospholipase 1 0.008326 9.665614 0.050704 0.960741 0.998727 -4.90858 17210869 17210869 NM_001159750 Tcea1 transcription elongation factor A (SII) 1 0.210958 8.376269 1.464098 0.179386 0.969223 -4.4408 17210883 17210883 XR_373197 LOC102631647 uncharacterized LOC102631647 0.08286 2.004754 0.841665 0.423177 0.972998 -4.7356 17210887 17210887 NM_133826 Atp6v1h ATPase, H+ transporting, lysosomal V1 subunit H 0.01326 7.819807 0.180675 0.860886 0.996585 -4.90076

What does the XR-398539 mean in the column of ID?And, in the result, there are some annotated names of genes, but there is no name in the GPL annotation file. What's the reason?

Sorry, my English is not very good, you know my description of the problem you have read?

Mmm, you also need to explore things yourselves a bit...

XR is one of the 9 RefSeq annotation categories; the abbreviation XR is used to describe a 'predicted ncRNA model' that has been given the (numerical) ID 398539. Please note that this is a computational prediction, so no experimental evidence does (yet) exist for this gene (model) to exist. See also: https://en.wikipedia.org/wiki/RefSeq (or if that link will not work for you here or here).

Regarding the absence of info in the GPL annotation file: I think this has to do with the fact that the annotation info at GEO was last updated in 2013 (Jan 30, 2013: annotation table updated with netaffx build 33), whereas the PdInfo package has been created with the latest Affymetrix information available, which is from January 2017 (netaffx build 36). In other words, the annoation info available at GEO is outdated.