how to annotate arrays
1
0
Entering edit mode
@ben_cossins-8532
Last seen 3.1 years ago
United Kingdom

So I’ve loaded data from the .CEL files into expression/feature/formal class sets (also have an annotated dataframe)

I've been able to do some basic visualisation on it through the oligo library, I can also create heatplots of the top # number of results (though not accounting for multiple testing) (trying to run it on full array resulted in R uing over 15GB of memory and PC requiring restart to become responsive)

however, using the command geneName=getNetAffx(celfiles, type="probeset") whilst creating an annotated dataframe, contains no useful data (see structure pasted below, hopefully I got my formatting right)

R identifies the arrays as pd.mogene.2.0.st

So my question is what am I doing wrong with regards getting gene names for the data (to see what expression changes there are)

affy annotation microarray oligo • 1.6k views
3
Entering edit mode
@james-w-macdonald-5106
Last seen 16 minutes ago
United States

Annotating your data that way is probably not that useful. The raw Affy annotation is, shall we say, messy.  Also note that the default summarization for Gene ST arrays (and the level that IMO you should only ever summarize the Gene ST arrays) is the 'core' or 'transcript' level, and you asked for the 'probeset' level annotation, so you have a mismatch. In addition, there are any number of things on these arrays that won't have any annotation per se. The first N probesets on the array are classified as normgene->intron, which is a background control probe that is supposed to be a background because they are complementary to intronic sequences. Because Affy somehow seems to think you wouldn't get signal from introns when using random primers, which just boggles the mind, but whatever.

Anyway, don't do that. Instead, you want to use the hugene20sttranscriptcluster.db package. You most likely want to use the select() function:

library(hugene20sttranscriptcluster.db)

gns <- select(hugene20sttranscriptcluster.db, featureNames(celfiles), c("ENTREZID","SYMBOL","GENENAME"))

But do note that a.) this will return one-to-many mappings, because a given probeset may interrogate multiple things, and b.) for this release, it will do that silently - you won't get a warning, so you have to remember this! I usually do the most naive thing possible, which is

gns <- gns[!duplicated(gns[,1]),]

And then you could stuff that into an AnnotationDataFrame and put it in your ExpressionSet. If you forget to subset and try to put that into your ExpressionSet, you will get a warning, so at least there is some protection when using that route.

0
Entering edit mode

first off, the reason I had probesets was that the PhD student who I'm working with (I'm doing my masters thesis) copied that line of code from his script for me to use.

secondly, this is being done with mouse genes, so I assume that I replace the hugene20st with mogene20st

trying to use the commands I get the message

gns <- select(mogene20sttranscriptcluster.db, featureNames(celfiles), c("ENTREZID","SYMBOL","GENENAME"))
Error in .testForValidKeys(x, keys, keytype) :
None of the keys entered are valid keys for 'PROBEID'. Please use the keys method to see a listing of valid arguments.

Tried initially with the hugene, swapped to mogene thinking it was just wrong IDs and got same output

thanks for your help so far

1
Entering edit mode

Oh yeah, I forgot. MoGene...

I made the assumption (given the sparsity of your code) that you had an ExpressionSet called 'celfiles' that you generated doing something like

firstobj <- read.celfiles(list.celfiles())
celfiles <- rma(firstobj)

In which case featureNames(celfiles) would give you all the Affy IDs for that array. But you have evidently done something else, so you should probably give us some more code showing what the 'celfiles' object is, and any modifications you have made to it.

0
Entering edit mode

that is what I did,

celfiles is an expression set, whereas geneName is an annotated dataframe generated using geneName=getNetAffx(celfiles, type="probeset")

I posted that because that was the structure of the annotation I was getting at the time

as for my code, so far I have this (I've left of the graphs created as those aren’t relevant here

source("http://bioconductor.org/biocLite.R")
biocLite()
library(Biobase)
library(oligo)
library(limma)
#library(affy)                # needed wih oligo library??
library(Heatplus)

setwd('/Pancreas_CEL_Files/')

norm = rma(celfiles)            # normalise expression data

design = model.matrix(~ index, data=pData(norm))
design[1:4,2] = 1            # WT,
design[5:8,2] = 2            # KO,
design[9:12,2] = 3            # 2X,
lm1 = lmFit(exprs(norm), design)
lm1 = eBayes(lm1)
geneID = rownames(topTable(lm1, coef=2, num=200, adjust="none", p.value=0.05))
length(geneID)
norm2 = norm[geneID,]
1
Entering edit mode

This part

design = model.matrix(~ index, data=pData(norm))
design[1:4,2] = 1            # WT,
design[5:8,2] = 2            # KO,
design[9:12,2] = 3            # 2X,

is almost surely not what you want to be doing. Can you say why you are doing that?

Your ExpressionSet is called 'norm', so you would annotate by doing something like

gns <- select(mogene20sttranscriptcluster.db, featureNames(norm), c("ENTREZID","SYMBOL","GENENAME"))

I use those three annotations - see columns(mogene20sttranscriptcluster.db) for more choices. Then

gns <- gns[!duplicated(gns[,1]),]
if(isTRUE(all.equal(row.names(lm1$coef), gns[,1])) lm1$genes <- gns

Then

topTable(lm1, 2)

will have all the annotations included.

0
Entering edit mode

the reason for putting those ranges in was me trying to replicate what I'd seen using example data for the "Heatplus" tutoria (which had case/control, and coded those 0/1, having rerun without I see I dont need it to get the clustering, (though the heatmap isn't quite as petty and lacks the clear cut differentiation between them)

having run that code it works perfectly (was initially worried as the first 1000 lines shown were all NA) but I now have the names showing, amazed that its around ~14k NA probes there though.

Thank you for helping me with this

0
Entering edit mode

James W. MacDonald Sorry to comment on an old thread.

But I have been trying to use hugene20sttranscriptcluster.db to get entrezids for the transcript cluster ids and I get an error message which i am really not sure what it means. Following is the code:

# Details of the platform:

GPL5175 [HuEx-1_0-st] Affymetrix Human Exon 1.0 ST Array [transcript (gene) version]

oligobatch=oligo::read.celfiles(filenames = file.path(myd,SDRF\$ID),verbose = FALSE, phenoData = SDRF)

# Normalized data

RMAmyData=oligo::rma(oligobatch, target = "core") # I assume that I have summarize the probes at transcript level

# gene symbols using getNetAffy

featureData(RMAmyData)= getNetAffx(RMAmyData, "transcript") # This works but I would like to have entrezids and getNetAffx doesn't annotate them.

# Input file for annotation looks like this

head(featureNames(RMAmyData)) [1] "2315554" "2315633" "2315674" "2315739" "2315894" "2315918" # I assume that these are transcript ids. Not sure though!

# Annotation

EntrezIDmyData <- AnnotationDbi::select(hugene20sttranscriptcluster.db, featureNames(RMAmyData), c("ENTREZID","SYMBOL"))

Error in .testForValidKeys(x, keys, keytype, fks) : None of the keys entered are valid keys for 'PROBEID'. Please use the keys method to see a listing of valid arguments.

I am confused with how the error reads. Does that mean that my data has not been summarized from exon-level to transcript level? Or it's just a generic error and where I should just focus on trouble shooting annotation step using 'hugene20sttranscriptcluster.db'. If yes, where am I going wrong?

I would really appreciate any response. Thanks.

0
Entering edit mode

Don't comment on old posts. Make a new one instead