Transcript to gene in clariom d human affymetrix data
Entering edit mode
Last seen 4.9 years ago

I am currently working with Clariom D Human Microarray. After normalization of the data with RMA{oligo}, I get  138745   features which I need to annotate for the downstream analysis that includes differential expression and gene enrichment steps. I would like to assign a gene identifier to a given probeid using the select() function of {AnnotationDbi}:

>annotation_testdata <- AnnotationDbi::select(clariomdhumantranscriptcluster.db, featureNames(all.eset), c("SYMBOL","GENENAME","ENTREZID", "ENSEMBL"))

We have only 31575 features of which we have gene identifiers:

> length(which(!$SYMBOL)))
[1] 31575
> length(which(!$GENENAME)))
[1] 31575
> length(which(!$ENTREZID)))
[1] 31575
> length(which(!$ENSEMBL)))
[1] 29625
> length(unique(annotation_testdata$ENTREZID))

I know that there are some features that just serve to control but I imagine that not as much ??

Thank you in advance for your help

annotation clariomdhumantranscriptcluster.db oligo • 3.3k views
Entering edit mode
Last seen 41 minutes ago
United States

The Clariom D array contains lots of content that is either speculative or non-translated. This includes things like lincRNAs and ESTs, neither of which, as a rule, have Entrez Gene IDs. ESTs and other inferred transcripts are just that - things that might be real, but maybe not. The same thing is true for lincRNAs there is some evidence that these things are transcribed and might actually do something, but it is somewhat preliminary at this time.

The annotation packages are based on Entrez Gene IDs, and to get into that database, there has to be convincing evidence for the existence of a gene (right now I think there are 20-30K such genes). So it's not that surprising that about 108K of the stuff Affy put on that array doesn't have an Entrez Gene ID.

You might do better to annotate using the annotation csv that Affy distributes. The easiest way to do that would be to use my affycoretools package.


all.eset <- annotateEset(all.eset, pd.clariom.d.human)

And now the annotation is in the featureData slot of your ExpressionSet, and will be automatically used by tools like limma, to annotate your topTable results. You can also access the featureData using the fData accessor:

annotation_testdata <- fData(all.eset)
Entering edit mode
Last seen 4.9 years ago
Thank you very much for your help. There is another question that I do not understand,  is the difference between number of gene (ID), choosing target = "core" or "probeset" in the RMA normalization step?

> Probeset.eset=oligo::rma(oligo::read.celfiles(list.files()),target="probeset")
> Trancript.eset=oligo::rma(oligo::read.celfiles(list.files()),target="core")
> Transcript.eset <- annotateEset(Trancript.eset, pd.clariom.d.human)
> Probeset.eset <- annotateEset(Probeset.eset, pd.clariom.d.human)

Erreur : There appears to be a mismatch between the ExpressionSet and the annotation data.
Please ensure that the summarization level for the ExpressionSet and the 'type' argument are the same. See ?annotateEset for more information on the type argument.

Of the blow, I annotated whith clariomdhumanprobeset.db :

>Probeset.eset <- annotateEset(Probeset.eset, clariomdhumanprobeset.db)

Here's what I get  for Target="probeset",  knowing that numbre of probeset obtained after rma 1562457 :

> annotation_Probeset <- fData(Probeset.eset)
> length(which(!$ENTREZID)))
[1] 695228
> length(unique(annotation_Probeset$ENTREZID))
[1] 26192

Here's what I get  for Target="core",  knowing that numbre of transcript obtained after rma 138745  :

> annotation_Trancript <- fData(Transcript.eset)
> length(which(!$ID)))
[1] 86161
> length(unique(annotation_Trancript$ID))
[1] 86152

What I do not understand is why so much difference on the numbre of unique geneID  when we work on "Probeset" or "transcripts". According to what I understood: The summarization is as follows:

probes ---> probesets ---> transcritps ---> genes.


Probes ---> Transcripts ---> genes

So in the end, whatever the target we choose at the beginning ("core", "transcript"), we get the same number of gene (ID) ??


Entering edit mode

Please use the ADD COMMENT link to add further comments or questions. If you use the Add your answer box, it looks like you are answering a question, which clearly you are not.

The summarization doesn't work the way you think, really. There are probes, and Affy has collected them into groups (probe set regions, or PSRs) that are intended to measure portions of an exon, or exon-exon junctions. This is the probeset level summarization, and you only get some information about the particular PSR being interrogated, but you can then hypothetically use that information to model differential exon usage.

Affy has also defined a collection of PSRs that measure all the exons for a set of known transcripts. The expression values for this summarization level are intended to give an estimate of the transcript abundance, which may differ between transcripts for a given gene (or may not - these days Affy recycles the probes quite a bit, so two transcripts for a given gene might only differ by a couple of PSR probesets).

But at no time is there any summarization at the gene level, because there is no such thing. Genes are regions defined on a genome, which give rise to transcripts of (possibly) varying lengths, and that is what we measure. It just happens that it is easier to think about 'gene' expression because it's easier to interpret higher/lower expression for a given gene, and much more difficult to interpret a combination of higher/lower expression for the N different transcripts of that gene, not to mention accurately estimating the transcription levels for all those transcripts.

You could hypothetically summarize all the transcript level probesets to a 'gene' level, by for example computing the mean of all the transcript probesets for that gene, or maybe picking the 'best' one, or just randomly picking one transcript per gene. But that's up to you.

So if you summarize at the probeset level, you get estimates for all the PSR and JUC probesets, and if you summarize at the core level you get estimates for all the various transcripts that Affy say are being measured. You should get lots more values if you summarize at the probeset level as compared to the transcript level.

Because of this, there are two different annotation packages. There are the clariomdhumanprobeset.db and the clariomdhumantranscript.db packages, and if you use the internal Affy data that comes with the pd.clariom.d.human package (with annotateEset), you specify using the 'type' argument, specifying either 'core' (the default) or 'probeset'.

Entering edit mode

Hi James, thank you again for your help. So the level-transcript is estimated from JUCs or PSRs probes?


Login before adding your answer.

Traffic: 702 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6