Question

Questions about analysis of GeneChip Rat Gene 2.0 ST Array

0

Entering edit mode

khosoda • 0

@khosoda-7307

Last seen 9.2 years ago

Japan

Dear all,

I am quite new to Bioconductor and have tried to process data from Affymetrix GeneChip Rat Gene 2.0 ST Array. This is my first time investigating data from this array. I have a couple of questions. I would appreciate any advice in advance.

1) I have realized "affy" package does not work for data from this array, and found "oligo" package does work. My normalizing processes are the followings:

library(oligo)
CEL_list <- list.celfiles("./data", full.names=TRUE)
celfiles <- read.celfiles(CEL_list)
celfiles.rma <- rma(celfiles, target="core")
celfiles.rma
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 36685 features, 10 samples
.......

# Use getMainProbes to remove control probesets from ST arrays

library(affycoretools)
celfiles.main <- getMainProbes(celfiles.rma)
celfiles.main
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 30472 features, 10 samples
.......

I guess I removed about 6000 control probsets. Is this right way?

2) I used "ragene20sttranscriptcluster.db" for the Genechip Rat Gene 2.0 ST Array. Is this right?

3) My annotation processes with "ragene20sttranscriptcluster.db" are the followings:

library(ragene20sttranscriptcluster.db)
eset.main <- exprs(celfiles.main)
dim(eset.main)
## [1] 30472 10
affyid <- rownames(eset.main)
egids2 <- ragene20sttranscriptclusterENTREZID[affyid]
annots <- toTable(egids2)
str(annots)
## 'data.frame': 15058 obs. of 2 variables:
## $ probe_id: chr "17610314" "17610335" "17610349" "17610410" ...
## $ gene_id : chr "100910373" "502213" "308257" "308265" ...
nrow(annots) ####
## [1] 15058
eset.main.affyid <- eset.main[annots$probe_id, ]
dim(eset.main.affyid)
## [1] 15058 10

At this stage, I have deleted more than 50% of the original data. Is this the right way? Or am I wrong?

Thank you very much for your help.

Kohkichi

microarray affy oligo annotation • 2.5k views

ADD COMMENT • link 9.3 years ago • updated 9.2 years ago khosoda • 0

0

Entering edit mode

You might try the SCAN function in the SCAN.UPC package. It's designed to handle this type of array (uses the oligo package behind the scenes).

ADD REPLY • link 9.3 years ago Stephen Piccolo ▴ 590

score 1 · Answer 1 · 2015-02-02

1.) I wouldn't use getMainProbes() to filter out the controls on the Gene 2.0 ST arrays. Affy added a bunch of different probe types on the 2.0 series that pdInfoBuilder doesn't currently know about. It's not that hard to filter though; you can use the netaffxTranscript.rda that comes with that package.

load(paste0(path.package("pd.ragene.2.0.st"), "/extdata/netaffxTranscript.rda"))

annot <- pData(netaffxTranscript)

ind <- annot$Category %in% "main"

eset <- eset[ind,]

2.) Yes, that is correct.

3.) Filtering down to just those probesets with an Entrez ID is something you can do, but it's not something I would normally do. If there is some less-well annotated content with a big difference, wouldn't you want to follow up on that and figure out what it is? What if it is an Ensembl gene that Entrez hasn't annotated yet?

Also, the way you are accessing the annotation data is ignoring any probes with multiple mapping. You should be using select() instead.

annot <- select(ragene20sttranscriptcluster.db, featureNames(eset.main), c("ENTREZID","SYMBOL","GENENAME"))

This will return a message that there are one-to-many mappings, meaning that some probes either measure more than one thing, or that some probes have duplicate Entrez IDs or Symbols that haven't been reconciled yet. How you deal with that is up to you. I just do the most naive thing:

annot <- annot[!duplicated(annot[,1]),]

Anyway, extracting the data from your ExpressionSet is neither necessary nor useful. Just leave it there. You should be making comparisons next, and pretty much any of the tools in Bioconductor are happy to use an ExpressionSet for that.

score 0 · Answer 2 · 2015-02-08

Hi, Jim.

Thank you very much for your detailed comments, and sorry for delayed response. It took several days to understand what you wrote. I also found your wonderful vignette, "Creating annotated output with affyycoretools and ReportingTools".

I used netaffxTranscript to filter out the controls, according to your advice, and I got;

celfiles.main
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 29489 features, 10 samples

In my previous analysis with getMainProbes, I had gotten;

celfiles.main
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 30472 features, 10 samples

So, netaffxtranscript removed more controls than getMainprobes. Is this right?

Next, I performed annotation with select() according to your advice.

annot2 <- select(ragene20sttranscriptcluster.db, featureNames(celfiles.main), c("ENTREZID","SYMBOL","GENENAME"))
annot2 <- annot2[!duplicated(annot2[,1]),]
eset.main <- exprs(celfiles.main)
eset.main <- cbind(eset.main, annot2) ## I got some annotations.
dim(eset.main)
## [1] 29489 14

Then, I used limma.

library(limma)
fit <- lmFit(celfiles.main, design)
contrast.matrix <- makeContrasts(MCAO_control=MCAO-control, levels=design)
fit2 <- contrasts.fit(fit, contrast.matrix)
fit2_eB <- eBayes(fit2)
fit2_eB_JM <- fit2_eB
all.equal(row.names(fit2_eB_JM$coef), annot2[,1])
## [1] TRUE
fit2_eB_JM$genes <- annot2 ## according to your vignette.

For probeset ID conversion,

## if multiple probe sets map to a gene, select the one with maximal IQR.
## I'm not sure whether this method is right or not.
iqrs=apply(eset.main[, 1:10], 1, IQR)
sel.rn=tapply(1:nrow(eset.main), eset.main$ENTREZID, function(x){ x[which.max(iqrs[x])] })
eset.main.egid=eset.main[sel.rn,]
rownames(eset.main.egid) <- names(sel.rn)
dim(eset.main.egid)
## [1] 12733 14
dim(eset.main)
## [1] 29489 14

So, I have deleted more than 50% of the original data again to obtain ENTREZID without NULL ID. I think that ENTREZID is necessary for gene set enrichment analysis. However, I guess those genes(?) without ENTREZID may be some transcripts such as miRNA etc and are also important. Is there any annotation method for them?

My last question is about HTML table generation. According to your vignette, I could make afile.html and afile3.html. However, afile2.html generation caused error.

htab <- HTMLReport("afile2", "MCAO model results, ReportingTools style")
publish(fit2_eB_JM, htab, eSet=celfiles.main, factor = pData(celfiles.main)$Target, coef = 1, n = nrow(topTable(fit2_eB_JM, coef=1, number=10000, p.value=0.05)))
Error in .make.gene.plots(df, eSet, factor, figure.directory, ...) : Can't find expression data for some features

celfiles.main is:
ExpressionSet (storageMode: lockedEnvironment)
assayData: 29489 features, 10 samples element names: exprs

Therefore, I'm not sure what was wrong.
I googled this error, but I could not find any solutions.

Thank you very much for your help in advance.

Kohkichi