Question

illuminaHumanv4.db versus illuminaHumanv2.db beadchip annotation differences

0

Entering edit mode

chris86 ▴ 420

@chris86-8408

Last seen 4.4 years ago

UCL, United Kingdom

Hi

I have data from the platform illuminaHumanv4.db and I am trying to do an analysis with data from the illuminaHumanv2.db platform and I am finding there are some genes which are not there in the newer illuminaHumanv4.db annotation which I can see in my DE genes from illuminaHumanv2.db.

These are some of the 1000 or so gene IDs with no match in my newer v4 array data, how do I match them, is it even possible? Thanks.

These are some of the genes...

"C6orf21"       "LOC644952"     "FLJ20035"      "LOC26010"      "TRIM6"         "LOC129607"
"DKFZp564K142" "C15orf51"      "LOC731865"     "LOC641704"     "LOC727778"     "TAAR2"
"DMRTC1"        "ACCN4"         "LOC139516"     "LOC644872"     "ZNF180"        "LOC392222"
"UNQ846"        "WDR32"         "CCL7"          "ELSPBP1"       "KIAA1239"      "DMRT1"
"NAT8L"         "LOC389844"     "LOC440669"

limma microarray • 547 views

ADD COMMENT • link updated 7.8 years ago by James W. MacDonald 65k • written 7.8 years ago by chris86 ▴ 420

score 2 · Accepted Answer · 2016-07-06

That's a tough thing to do, as that is a mixture of IDs. For example, C6orf21 is an open reading frame ID, and can be mapped using the org.Hs.eg.db package:

> select(org.Hs.eg.db,"C6orf21", "SYMBOL","ALIAS")
'select()' returned 1:1 mapping between keys and columns
ALIAS SYMBOL
1 C6orf21 LY6G6F

But LOC644952 is an identifier for a 'gene of uncertain function', and like all such things, is simply LOC appended on the Entrez Gene ID. And if you go to NCBI and search on that ID, it will tell you that they have decided it's not a thing after all. And FLJ20035 is a hypothetical protein ID, according to NCBI.

It's hard to programmatically query on a set of mixed IDs, and given the age of the Illumina Human WG-6 V2 array, it's not surprising that some of the things it intended to query are not thought to be real any longer. You could iteratively query the vector of non-matching IDs, using ALIAS, then maybe strip off the LOC and query on the ENTREZID, then maybe try biomaRt or the UniProt site to try to get the protein IDs mapped. Or you could just accept that the upside is sometimes not worth the extra effort, and use whatever genes are in the intersection. Trying to make comparisons using two different arrays (from I assume very different times) is already difficult enough, without adding in this extra complication.