The support.bioconductor.org editor has been updated to markdown! Please see more info at: Tutorial: Updated Support Site Editor

Question: HuGene20st - lincRNA annotation
0
3.7 years ago by
European Union
maria.maqueda0 wrote:

Hello all,

My name is Maria Maqueda and I am working with some data from HuGene20st microarrays (at transcript cluster level). This is not the first time working with these arrays but it seems I am again struggling with the annotation. Mainly, I have two questions:

1) Regarding lincRNA annotation. I am obtaining around 730 lincRNA-related transcripts through  hugene20sttranscriptcluster.db (v8.3.0), while in annotation file from Affymetrix, there are around 12k (mrna assignment category).  Some time ago (late 2013) I already asked about this difference regarding lincRNA annotation (https://support.bioconductor.org/p/56347/#56349), do you foresee any better alignment between them?

2) Regarding cross-hybridization category. I have obtained 2613 transcripts from hugene20sttranscriptcluster.db (v8.3.0) which have "Mixed" cross-hybridization value in Affymetrix annotation file. My initial idea was to keep only "main" and "unique" (X-hyb) transcripts for further analysis, but based on this result I have my doubts. Could it be an error in the Affymetrix annotation files? Anyone has any suggestion about how to deal with this "mixed" X-hyb transcripts?

Kind Regards,

Maria

sessionInfo()

R version 3.2.0 (2015-04-16)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.1 (Yosemite)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] hugene20sttranscriptcluster.db_8.3.0 org.Hs.eg.db_3.1.2
[3] RSQLite_1.0.0                        DBI_0.3.1
[5] AnnotationDbi_1.30.1                 GenomeInfoDb_1.4.0
[7] IRanges_2.2.1                        S4Vectors_0.6.0
[9] Biobase_2.28.0                       BiocGenerics_0.14.0

loaded via a namespace (and not attached):
[1] tools_3.2.0

modified 3.7 years ago by James W. MacDonald49k • written 3.7 years ago by maria.maqueda0
0
3.7 years ago by
United States
James W. MacDonald49k wrote:

1.) Hypothetically, but it will likely take some time and further integration between the Broad's data, Noncode, and NCBI, and perhaps some extra work on my part. The annotation database packages we supply are based on mappings between Affy's ID and either RefSeq/GenBank or Gene. So I parse the annotation files to get out all such IDs, and then run through the pipeline to create the annotation package.

But Affy uses more than just those three databases to assign probesets to (especially non-coding) RNA transcripts. For example, here is one:

n363783 // NONCODE // accn=NULL class=lncRNA name=Human lincRNA ref=BodyMapLinc transcriptId=TCONS_00000796 cpcScore=-0.9037550 cnci=-0.1542333 // chr1 // 100 // 100 // 23 // 23 // 0 /// TCONS_00000796-XLOC_000010 // Rinn lincRNA // linc-SAMD11-2 chr1:+:840486-841186 // chr1 // 100 // 100 // 23 // 23 // 0

So that has a noncode ID, and a Broad TCONS ID, but nothing else, so it is invisible to the existing process. And the noncode site doesn't have any information that is useful for annotation (other than chromosomal location), so I don't know what we would put in the annotation package anyway.

Many others have Havana IDs, an example being OTTHUMT00000036896, which also has no real annotatable information. At this point we seem to be at the same stage as the EST era, circa 2000 where we have lots of people saying 'look, I found something', but not much more than that.

2.) How you deal with the mixed probesets is up to you. It is certainly not possible to determine exactly what you are measuring with these probesets, so that is an argument for excluding.

As for whether or not there are errors in the Affy annotation files, the answer is 'of course'. You cannot create a file based on the rapidly evolving annotation data without making some mistakes. Or a lot of mistakes, perhaps.

A co-worker of mine just re-ran an analysis of some Exon ST arrays that he had originally annotated with the na31 file, but using the updated na34 annotations. Much to his surprise, most of the annotation data had vanished. Upon contacting Affy, he was told that they knew there was a problem with the na34 data like a year ago, and that he should use the recently released na35 data. Because, you know, it's better and stuff.

1) Understood. Fully agree that most probably I will finally prioritized transcripts with no additional information rather than an ID.

2) Thanks for sharing your experience, it's some how....disturbing.

So, my personal outcome is that I will have to be very careful with those annotated transcripts through the Affy annot file, which basically will be the non-coding ones.

Cheers,

María