illuminaHumanv4.db versus illuminaHumanv2.db beadchip annotation differences
1
0
Entering edit mode
chris86 ▴ 420
@chris86-8408
Last seen 4.4 years ago
UCL, United Kingdom

Hi

I have data from the platform illuminaHumanv4.db and I am trying to do an analysis with data from the illuminaHumanv2.db platform and I am finding there are some genes which are not there in the newer illuminaHumanv4.db annotation which I can see in my DE genes from illuminaHumanv2.db.

These are some of the 1000 or so gene IDs with no match in my newer v4 array data, how do I match them, is it even possible? Thanks.

These are some of the genes...

"C6orf21"       "LOC644952"     "FLJ20035"      "LOC26010"      "TRIM6"         "LOC129607"    
"DKFZp564K142"  "C15orf51"      "LOC731865"     "LOC641704"     "LOC727778"     "TAAR2"        
"DMRTC1"        "ACCN4"         "LOC139516"     "LOC644872"     "ZNF180"        "LOC392222"    
"UNQ846"        "WDR32"         "CCL7"          "ELSPBP1"       "KIAA1239"      "DMRT1"        
"NAT8L"         "LOC389844"     "LOC440669"    

limma microarray • 547 views
ADD COMMENT
2
Entering edit mode
@james-w-macdonald-5106
Last seen 3 hours ago
United States

That's a tough thing to do, as that is a mixture of IDs. For example, C6orf21 is an open reading frame ID, and can be mapped using the org.Hs.eg.db package:

>  select(org.Hs.eg.db,"C6orf21", "SYMBOL","ALIAS")
'select()' returned 1:1 mapping between keys and columns
    ALIAS SYMBOL
1 C6orf21 LY6G6F

But LOC644952 is an identifier for a 'gene of uncertain function', and like all such things, is simply LOC appended on the Entrez Gene ID. And if  you go to NCBI and search on that ID, it will tell you that they have decided it's not a thing after all. And FLJ20035 is a hypothetical protein ID, according to NCBI.

It's hard to programmatically query on a set of mixed IDs, and given the age of the Illumina Human WG-6 V2 array, it's not surprising that some of the things it intended to query are not thought to be real any longer. You could iteratively query the vector of non-matching IDs, using ALIAS, then maybe strip off the LOC and query on the ENTREZID, then maybe try biomaRt or the UniProt site to try to get the protein IDs mapped. Or you could just accept that the upside is sometimes not worth the extra effort, and use whatever genes are in the intersection. Trying to make comparisons using two different arrays (from I assume very different times) is already difficult enough, without adding in this extra complication.

 

ADD COMMENT

Login before adding your answer.

Traffic: 946 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6