For the HTA2.0 arrays should not use hgu95av2.db
, which is the annotation library for the, ehh, human U95 set of Affymetrix arrays! You should rather use the hta20transcriptcluster.db
. Link here.
After modifying this, and provided you use the correct (thus HTA2.0) probeset ids, your code above should work.
Also check this thread that links so several very relevant posts on the support site dealing with the difference between the analyses of the HTA2.0 array on the level of probesets or transcripts! How to annotate the Affymetrix data HTA-2.0 with oligo
BTW: the function annotateEset
from the package affycoretools
(link) allows to easy annotate your (normalized) data object; if you do annotate the data set before 'transferring' it to limma
, then these annotation will be automagically transferred to the limma
object as well.
Some example code to get you started (I downloaded 6 CEL files (= raw data) from a random GEO submission (GSE54937).
After extracting the files in the GEO archive GSE54937_RAW.tar
into the folder GSE54937_RAW:
library(oligo)
library(affycoretools)
library(hta20transcriptcluster.db)
#set path
path <- './GSE54937_RAW'
#load and normalize (RMA) the data
raw.affy.data <- read.celfiles(filenames = list.celfiles(path, full.names=TRUE, listGzipped=TRUE) )
normalized.data <- rma(raw.affy.data, target = "core")
# add annotation data
library(hta20transcriptcluster.db)
normalized.data <- annotateEset(normalized.data, hta20transcriptcluster.db)
# check
fData(normalized.data)[5000:5010,]
PROBEID ENTREZID SYMBOL
TC01002046.hg.1 TC01002046.hg.1 <NA> <NA>
TC01002047.hg.1 TC01002047.hg.1 <NA> <NA>
TC01002048.hg.1 TC01002048.hg.1 <NA> <NA>
TC01002049.hg.1 TC01002049.hg.1 267002 PGBD2
TC01002050.hg.1 TC01002050.hg.1 653635 WASH7P
TC01002051.hg.1 TC01002051.hg.1 645520 FAM138A
TC01002052.hg.1 TC01002052.hg.1 <NA> <NA>
TC01002053.hg.1 TC01002053.hg.1 729737 LOC729737
TC01002054.hg.1 TC01002054.hg.1 <NA> <NA>
TC01002055.hg.1 TC01002055.hg.1 <NA> <NA>
TC01002056.hg.1 TC01002056.hg.1 <NA> <NA>
GENENAME
TC01002046.hg.1 <NA>
TC01002047.hg.1 <NA>
TC01002048.hg.1 <NA>
TC01002049.hg.1 piggyBac transposable element derived 2
TC01002050.hg.1 WASP family homolog 7, pseudogene
TC01002051.hg.1 family with sequence similarity 138 member A
TC01002052.hg.1 <NA>
TC01002053.hg.1 uncharacterized LOC729737
TC01002054.hg.1 <NA>
TC01002055.hg.1 <NA>
TC01002056.hg.1 <NA>
Be sure the check the help page for the function annotateEset
(by typing ?annotateEset
) for more info which annotation info can be retrieved.
Thank you so much for your explanation! I have a question about how can I get the probeIDs and are they different from the probe set ID?
The probe IDs are different from the probeset IDs, and are usually not of interest. Each probeset is made up of 4 or more probes (depending on the level of summarization you are using), and they are summarized to estimate the overall PSR or transcript expression level.
The IDs can be extracted from the pmfeature table of the
pd.hta.2.0
package:I should also point out that the fid isn't really the probe ID. For this array (and IIRC all of the random-primer arrays), there really isn't a probe ID, but instead it's just the index position of the probe when you read the array. So for example, the first probe ID for TC01000001.hg.1 is 5481205, and if you were to get the raw data for an array, it would be the 5481205th value (the data from an Affy array is just a vector of numbers).
So for a given PSR, say 18670006, the fids are
and to summarize we would take the three values at those positions in the matrix of array data and run through the RMA algorithm, which is what
rma
does for you, but for all of the probesets instead of just one.In addition to James's answer (maybe this was actually confusing you): as far as I know all annotation libraries use the term
"PROBEID"
to refer to the identifier given by the array manufacturer to a (set of) DNA sequences (=probes) that is/are used to measure the expression of a transcript. FYI: The Affymetrix platform was designed such way that a set of individual probes is used to estimate the expression of a probeSET, that in turn reflects the expression of a transcript. In contrast, on Agilent and Illumina arrays only a single DNA probe is used to measure the expression of a transcript. In other words, Agilent and Illumina arrays don't make use of SETS of probes, and hence the DNA sequences on the glass surface are [just] called probes. The length of the Affymetrix probes are relatively short (~25 nt; hence their rationale to combine multiple probes for a reliable probeset measurement), but the Agilent (60 nt) and Illumina (50 nt) probes are much longer.Oh, wait. That's probably a more accurate interpretation of the OP's question... Thanks Guido!
Thank you!
Thank you for your help!
I was wondering if you can explain how to analyze the results from the gProfiler
I don't have experience with g:Profiler myselves, but if I recall correctly it is a website that can be used for functional analyses of lists of genes. Please note that this is not (!) a Bioconducor package, so I believe this question is better asked at e.g. Biostars. Yet, if you would like to perform your analyses in R, you may want to check the CRAN package
gProfiler2
(link). I also know of the Cytoscape plugin EnrichmentMap that can be used for visualization of g:Profiler results; check for example this Nature Protocols paper (here) to get you started with that. Good luck!