Question: difference between and hugene21sttranscriptcluster.db
gravatar for Irene
15 months ago by
Irene0 wrote:

What is the difference in annotation between the bioconductor packages and hugene21sttranscriptcluster.db?

I have affymetrix data (hugene (sense target) 16-Array Plate). Before annotation, I performed RMA-preprocessing, which includes RMA-background correction + quantile-normalization + summarization (target = core) using the bioconductor package oligo (bioconductor version 3.7.). When using the bioconductor package, 38598 out of 53617 features have a valid geneassignment. With hugene21sttranscriptcluster.db, an Entrez-ID can be assigned to 29224 out of the 53617 features, thereof 1810 features do not have a valid geneassignment when annotating with

How does this discrepancy occur?

Which package would you recommend for annotation?

Is it possible to get the Entrez-ID from the variable „geneassignment“ of the package?

(See R-Code below)

Thanks in advance!

Best regards,



#data import
#[1] 1416100      16

ppData<-rma(rawData, target ="core")
#Background correcting
#Calculating Expression

#Features  Samples
#   53617       16

#annotation with the package
#38598 15019

#annotation with the package hugene21sttranscriptcluster.db
#get the probe identifiers that are mapped to an Entrez Gene ID
#Convert to a list

table(row.names(exprs(ppData)) %in% names(xx))
#24393 29224

table(row.names(exprs(ppData)) %in% names(xx) &$geneassignment))
#51807  1810



ADD COMMENTlink modified 15 months ago by James W. MacDonald52k • written 15 months ago by Irene0
Answer: difference between and hugene21sttranscriptcluster.db
gravatar for James W. MacDonald
15 months ago by
United States
James W. MacDonald52k wrote:

The difference between the pdInfo package and the ChipDb package (hugene20sttranscriptcluster.db) is that the former contains the relatively unprocessed annotation data from Affymetrix, and the latter contains just the annotations that could be mapped to NCBI Gene IDs from whatever is in the geneassignment column of the Affymetrix annotation column.

In other words, Affy uses a whole bunch of input to decide what is and isn't a gene, and then bases their probes on the set of 'genes' that they think exist. But the ChipDb packages use NCBI Gene IDs as the central ID, so if Affy says that there is a gene, based on some other annotation service, and NCBI says 'nope', then for the ChipDb package, that annotation is dropped because there isn't a Gene ID.

I in general do

ppData <- rma(rawData)
ppData <- annotateEset(ppData, hugene20sttranscriptcluster.db)

And then ppData can be analyzed using limma, and the topTable will then be annotated with (what I think are) the most useful annotations.

But that's because affycoretools is my own package, and I am the one who generated the ChipDb packages. You could instead say that you want Ensembl IDs and use biomaRt to get them. Either method is easier than trying to parse the Affy data in the pdInfo package - that's a non-trivial endeavor, and I am not sure what you gain will be worth the effort.

ADD COMMENTlink written 15 months ago by James W. MacDonald52k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 285 users visited in the last hour