Question

HuGene20st - lincRNA annotation

0

Entering edit mode

maria.maqueda • 0

@mariamaqueda-7673

Last seen 9.2 years ago

European Union

Hello all,

My name is Maria Maqueda and I am working with some data from HuGene20st microarrays (at transcript cluster level). This is not the first time working with these arrays but it seems I am again struggling with the annotation. Mainly, I have two questions:

1) Regarding lincRNA annotation. I am obtaining around 730 lincRNA-related transcripts through hugene20sttranscriptcluster.db (v8.3.0), while in annotation file from Affymetrix, there are around 12k (mrna assignment category). Some time ago (late 2013) I already asked about this difference regarding lincRNA annotation (https://support.bioconductor.org/p/56347/#56349), do you foresee any better alignment between them?

2) Regarding cross-hybridization category. I have obtained 2613 transcripts from hugene20sttranscriptcluster.db (v8.3.0) which have "Mixed" cross-hybridization value in Affymetrix annotation file. My initial idea was to keep only "main" and "unique" (X-hyb) transcripts for further analysis, but based on this result I have my doubts. Could it be an error in the Affymetrix annotation files? Anyone has any suggestion about how to deal with this "mixed" X-hyb transcripts?

Many thanks in advance for any help you could bring.

Kind Regards,

Maria

sessionInfo()

R version 3.2.0 (2015-04-16)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.1 (Yosemite)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base

other attached packages:
[1] hugene20sttranscriptcluster.db_8.3.0 org.Hs.eg.db_3.1.2
[3] RSQLite_1.0.0 DBI_0.3.1
[5] AnnotationDbi_1.30.1 GenomeInfoDb_1.4.0
[7] IRanges_2.2.1 S4Vectors_0.6.0
[9] Biobase_2.28.0 BiocGenerics_0.14.0

loaded via a namespace (and not attached):
[1] tools_3.2.0

hugene20sttranscriptcluster.db lincRNA annotation Affymetrix microarrays HuGene20st • 1.6k views

ADD COMMENT • link updated 9.2 years ago by James W. MacDonald 66k • written 9.2 years ago by maria.maqueda • 0

score 0 · Answer 1 · 2015-06-02

1.) Hypothetically, but it will likely take some time and further integration between the Broad's data, Noncode, and NCBI, and perhaps some extra work on my part. The annotation database packages we supply are based on mappings between Affy's ID and either RefSeq/GenBank or Gene. So I parse the annotation files to get out all such IDs, and then run through the pipeline to create the annotation package.

But Affy uses more than just those three databases to assign probesets to (especially non-coding) RNA transcripts. For example, here is one:

n363783 // NONCODE // accn=NULL class=lncRNA name=Human lincRNA ref=BodyMapLinc transcriptId=TCONS_00000796 cpcScore=-0.9037550 cnci=-0.1542333 // chr1 // 100 // 100 // 23 // 23 // 0 /// TCONS_00000796-XLOC_000010 // Rinn lincRNA // linc-SAMD11-2 chr1:+:840486-841186 // chr1 // 100 // 100 // 23 // 23 // 0

So that has a noncode ID, and a Broad TCONS ID, but nothing else, so it is invisible to the existing process. And the noncode site doesn't have any information that is useful for annotation (other than chromosomal location), so I don't know what we would put in the annotation package anyway.

Many others have Havana IDs, an example being OTTHUMT00000036896, which also has no real annotatable information. At this point we seem to be at the same stage as the EST era, circa 2000 where we have lots of people saying 'look, I found something', but not much more than that.

2.) How you deal with the mixed probesets is up to you. It is certainly not possible to determine exactly what you are measuring with these probesets, so that is an argument for excluding.

As for whether or not there are errors in the Affy annotation files, the answer is 'of course'. You cannot create a file based on the rapidly evolving annotation data without making some mistakes. Or a lot of mistakes, perhaps.

A co-worker of mine just re-ran an analysis of some Exon ST arrays that he had originally annotated with the na31 file, but using the updated na34 annotations. Much to his surprise, most of the annotation data had vanished. Upon contacting Affy, he was told that they knew there was a problem with the na34 data like a year ago, and that he should use the recently released na35 data. Because, you know, it's better and stuff.