Questions about analysis of GeneChip Rat Gene 2.0 ST Array
2
0
Entering edit mode
khosoda • 0
@khosoda-7307
Last seen 9.2 years ago
Japan

Dear all,

I am quite new to Bioconductor and have tried to process data from Affymetrix GeneChip Rat Gene 2.0 ST Array. This is my first time investigating data from this array. I have a couple of questions. I would appreciate any advice in advance.

1) I have realized "affy" package does not work for data from this array, and found "oligo" package does work. My normalizing processes are the followings:

library(oligo)
CEL_list <- list.celfiles("./data", full.names=TRUE)
celfiles <- read.celfiles(CEL_list)
celfiles.rma <- rma(celfiles, target="core")
celfiles.rma
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 36685 features, 10 samples
.......

# Use getMainProbes to remove control probesets from ST arrays

library(affycoretools)
celfiles.main <- getMainProbes(celfiles.rma)
celfiles.main
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 30472 features, 10 samples
.......

I guess I removed about 6000 control probsets. Is this right way?

2) I used "ragene20sttranscriptcluster.db" for the Genechip Rat Gene 2.0 ST Array. Is this right?

3) My annotation processes with "ragene20sttranscriptcluster.db" are the followings:

library(ragene20sttranscriptcluster.db)
eset.main <- exprs(celfiles.main)
dim(eset.main)
## [1] 30472    10
affyid <- rownames(eset.main)
egids2 <- ragene20sttranscriptclusterENTREZID[affyid]
annots <- toTable(egids2)
str(annots)
## 'data.frame':    15058 obs. of  2 variables:
##  $ probe_id: chr  "17610314" "17610335" "17610349" "17610410" ...
##  $ gene_id : chr  "100910373" "502213" "308257" "308265" ...
nrow(annots) ####
## [1] 15058
eset.main.affyid <- eset.main[annots$probe_id, ]
dim(eset.main.affyid)
## [1] 15058    10

At this stage, I have deleted more than 50% of the original data. Is this the right way? Or am I wrong?

Thank you very much for your help.

Kohkichi

microarray affy oligo annotation • 2.5k views
ADD COMMENT
0
Entering edit mode

You might try the SCAN function in the SCAN.UPC package. It's designed to handle this type of array (uses the oligo package behind the scenes).

ADD REPLY
1
Entering edit mode
@james-w-macdonald-5106
Last seen 4 hours ago
United States

1.) I wouldn't use getMainProbes() to filter out the controls on the Gene 2.0 ST arrays. Affy added a bunch of different probe types on the 2.0 series that pdInfoBuilder doesn't currently know about. It's not that hard to filter though; you can use the netaffxTranscript.rda that comes with that package.

load(paste0(path.package("pd.ragene.2.0.st"), "/extdata/netaffxTranscript.rda"))

annot <- pData(netaffxTranscript)

ind <- annot$Category %in% "main"

eset <- eset[ind,]

2.) Yes, that is correct.

3.) Filtering down to just those probesets with an Entrez ID is something you can do, but it's not something I would normally do. If there is some less-well annotated content with a big difference, wouldn't you want to follow up on that and figure out what it is? What if it is an Ensembl gene that Entrez hasn't annotated yet?

Also, the way you are accessing the annotation data is ignoring any probes with multiple mapping. You should be using select() instead.

annot <- select(ragene20sttranscriptcluster.db, featureNames(eset.main), c("ENTREZID","SYMBOL","GENENAME"))

This will return a message that there are one-to-many mappings, meaning that some probes either measure more than one thing, or that some probes have duplicate Entrez IDs or Symbols that haven't been reconciled yet. How you deal with that is up to you. I just do the most naive thing:

annot <- annot[!duplicated(annot[,1]),]

Anyway, extracting the data from your ExpressionSet is neither necessary nor useful. Just leave it there. You should be making comparisons next, and pretty much any of the tools in Bioconductor are happy to use an ExpressionSet for that.

 

ADD COMMENT
0
Entering edit mode
khosoda • 0
@khosoda-7307
Last seen 9.2 years ago
Japan

Hi, Jim.

Thank you very much for your detailed comments, and sorry for delayed response. It took several days to understand what you wrote. I also found your wonderful vignette, "Creating annotated output with affyycoretools and ReportingTools".

I used netaffxTranscript to filter out the controls, according to your advice, and I got;

celfiles.main
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 29489 features, 10 samples

In my previous analysis with getMainProbes, I had gotten;

celfiles.main
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 30472 features, 10 samples

So, netaffxtranscript removed more controls than getMainprobes. Is this right?

Next, I performed annotation with select() according to your advice.

annot2 <- select(ragene20sttranscriptcluster.db, featureNames(celfiles.main), c("ENTREZID","SYMBOL","GENENAME"))
annot2 <- annot2[!duplicated(annot2[,1]),]
eset.main <- exprs(celfiles.main)
eset.main <- cbind(eset.main, annot2) ## I got some annotations.
dim(eset.main)
## [1] 29489    14

Then, I used limma.

library(limma)
fit <- lmFit(celfiles.main, design)
contrast.matrix <- makeContrasts(MCAO_control=MCAO-control, levels=design)
fit2 <- contrasts.fit(fit, contrast.matrix)
fit2_eB <- eBayes(fit2)
fit2_eB_JM <- fit2_eB
all.equal(row.names(fit2_eB_JM$coef), annot2[,1])
## [1] TRUE
fit2_eB_JM$genes <- annot2  ## according to your vignette.

For probeset ID conversion,

## if multiple probe sets map to a gene, select the one with maximal IQR.
## I'm not sure whether this method is right or not.
iqrs=apply(eset.main[, 1:10], 1, IQR)
sel.rn=tapply(1:nrow(eset.main), eset.main$ENTREZID, function(x){ x[which.max(iqrs[x])] })
eset.main.egid=eset.main[sel.rn,]
rownames(eset.main.egid) <- names(sel.rn)
dim(eset.main.egid)
## [1] 12733    14
dim(eset.main)
## [1] 29489    14

So, I have deleted more than 50% of the original data again to obtain ENTREZID without NULL ID. I think that ENTREZID is necessary for gene set enrichment analysis. However, I guess those genes(?) without ENTREZID may be some transcripts such as miRNA etc and are also important. Is there any annotation method for them?

My last question is about HTML table generation. According to your vignette, I could make afile.html and afile3.html. However, afile2.html generation caused error.

htab <- HTMLReport("afile2", "MCAO model results, ReportingTools style")
publish(fit2_eB_JM, htab, eSet=celfiles.main, factor = pData(celfiles.main)$Target, coef = 1, n = nrow(topTable(fit2_eB_JM, coef=1, number=10000, p.value=0.05)))
 Error in .make.gene.plots(df, eSet, factor, figure.directory, ...) : Can't find expression data for some features

celfiles.main is:
ExpressionSet (storageMode: lockedEnvironment)
assayData: 29489 features, 10 samples element names: exprs

Therefore, I'm not sure what was wrong.
I googled this error, but I could not find any solutions.

Thank you very much for your help in advance.

Kohkichi

ADD COMMENT
1
Entering edit mode

1.) Yes, you should have fewer probesets after using the netaffxTranscript.rda. The reason being that Affy has added yet more probeset types that are not currently recognized by pdInfoBuilder, so they get mis-labeled as 'main' in the pdInfo package.

2.) I almost never subset out the duplicated genes, as I don't see the upside to doing that. The only instance where you really need to do that is when you do a GO hypergeometric or a GSEA. At that point you could argue that using IQR is a reasonable thing to do. You could also argue for filtering by t-statistic as well. There is a cost/benefit to any filtering criterion you might want to use, and it is best to try to think that through and decide what costs you are willing to incur for a given benefit.

In addition, you are right that there are lots of things that are real transcripts (or are considered real at this point in time), many of which are untranslated transcripts (lincRNAs, miRNAs, snoRNAs, etc). You can use the netaffxTranscript.rda data.frame to get at that content, but be prepared for some fun parsing all the stuff that Affy jams into that file.

3.) The error gives you a big hint. It can't find expression data for some features. That means that one of the things you passed into publish() has more features than the other thing you passed in. Or possibly different features. I can't really help with that one, but you can always do

debug(ReportingTools:::.make.gene.plots)

and step through that function to see where things went wrong.

ADD REPLY
0
Entering edit mode

Hi, Jim.

Thank you very much for your reply. 
I realized that there are a lot more things I have to understand before analyzing my data. 

Regarding the HTML generation errors, I could work around the trouble by doing:

htab <- HTMLReport("afile2", "MCAO model results, ReportingTools style")
finish(htab)
publish(fit2_eB_JM, htab, eSet=celfiles.main, coef = 1, factor = pData(celfiles.main)$Target, n = nrow(topTable(fit2_eB_JM, coef=1, number=10000, p.value=0.05)), .modifyDF = list(affyLinks, entrezLinks))

This solution came from some post by yourself somewhere in the net. :-)

Anyway, I really appreciate your kind and detailed advices.

Kohkichi

ADD REPLY
0
Entering edit mode

Hi 

How to use netaffxTranscript to remove control probesets? I'm working on Human Gene 2.0 ST array.

celfiles = list.files(path = ".", pattern = ".CEL$", all.files = FALSE,
                      full.names = FALSE, recursive = FALSE, ignore.case = FALSE)
Data <- read.celfiles(celfiles)
celfiles.rma <- rma(Data, target="core")

celfiles.rma

ExpressionSet (storageMode: lockedEnvironment)
assayData: 53617 features, 6 samples 
  element names: exprs 
protocolData
  rowNames: #8_(HuGene-2_0-st).CEL E_(HuGene-2_0-st).CEL ... T_(HuGene-2_0-st).CEL (6 total)
  varLabels: exprs dates
  varMetadata: labelDescription channel
phenoData
  rowNames: #8_(HuGene-2_0-st).CEL E_(HuGene-2_0-st).CEL ... T_(HuGene-2_0-st).CEL (6 total)
  varLabels: index
  varMetadata: labelDescription channel
featureData: none
experimentData: use 'experimentData(object)'
Annotation: pd.hugene.2.0.st

So now how to remove control probesets?

ADD REPLY

Login before adding your answer.

Traffic: 540 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6