Question

difference between pd.hugene.2.1.st and hugene21sttranscriptcluster.db

0

Entering edit mode

Irene • 0

@irene-17298

Last seen 5.5 years ago

What is the difference in annotation between the bioconductor packages pd.hugene.2.1.st and hugene21sttranscriptcluster.db?

I have affymetrix data (hugene 2.1.st (sense target) 16-Array Plate). Before annotation, I performed RMA-preprocessing, which includes RMA-background correction + quantile-normalization + summarization (target = core) using the bioconductor package oligo (bioconductor version 3.7.). When using the bioconductor package pd.hugene.2.1.st, 38598 out of 53617 features have a valid geneassignment. With hugene21sttranscriptcluster.db, an Entrez-ID can be assigned to 29224 out of the 53617 features, thereof 1810 features do not have a valid geneassignment when annotating with pd.hugene.2.1.st.

How does this discrepancy occur?

Which package would you recommend for annotation?

Is it possible to get the Entrez-ID from the variable „geneassignment“ of the pd.hugene.2.1.st package?

(See R-Code below)

Thanks in advance!

Best regards,
Irene

R-Code:

library(oligo)
library(pd.hugene.2.1.st)                                     
library(hugene21sttranscriptcluster.db)

#data import
rawData<-read.celfiles(celFiles)
dim(exprs(rawData))
#[1] 1416100      16

#rma-preprocessing
ppData<-rma(rawData, target ="core")
#Background correcting
#Normalizing
#Calculating Expression

dim(ppData)
#Features  Samples
#   53617       16

#annotation with the package pd.hugene.2.1.st
annotation_pd<-getNetAffx(ppData,"transcript")
table(is.na(annotation_pd$geneassignment))
#FALSE  TRUE
#38598 15019

#annotation with the package hugene21sttranscriptcluster.db
x<-hugene21sttranscriptclusterENTREZID
#get the probe identifiers that are mapped to an Entrez Gene ID
mapped_probes<-mappedkeys(x)
#Convert to a list
xx<-as.list(x[mapped_probes])

table(row.names(exprs(ppData)) %in% names(xx))
#FALSE  TRUE
#24393 29224

table(row.names(exprs(ppData)) %in% names(xx) & is.na(annotation_pd$geneassignment))
#FALSE  TRUE
#51807  1810

annotation microarray bioconductor pd.hugene.2.1.st hugene21sttranscriptcluster.db • 1.6k views

ADD COMMENT • link updated 5.6 years ago by James W. MacDonald 65k • written 5.6 years ago by Irene • 0

score 3 · Answer 1 · 2018-09-12

The difference between the pdInfo package pd.hugene.2.0.st) and the ChipDb package (hugene20sttranscriptcluster.db) is that the former contains the relatively unprocessed annotation data from Affymetrix, and the latter contains just the annotations that could be mapped to NCBI Gene IDs from whatever is in the geneassignment column of the Affymetrix annotation column.

In other words, Affy uses a whole bunch of input to decide what is and isn't a gene, and then bases their probes on the set of 'genes' that they think exist. But the ChipDb packages use NCBI Gene IDs as the central ID, so if Affy says that there is a gene, based on some other annotation service, and NCBI says 'nope', then for the ChipDb package, that annotation is dropped because there isn't a Gene ID.

I in general do

ppData <- rma(rawData)
library(affycoretools)
ppData <- annotateEset(ppData, hugene20sttranscriptcluster.db)

And then ppData can be analyzed using limma, and the topTable will then be annotated with (what I think are) the most useful annotations.

But that's because affycoretools is my own package, and I am the one who generated the ChipDb packages. You could instead say that you want Ensembl IDs and use biomaRt to get them. Either method is easier than trying to parse the Affy data in the pdInfo package - that's a non-trivial endeavor, and I am not sure what you gain will be worth the effort.