Question: Inconsistent Annotations in illuminaHumanv3.db
gravatar for Peter Humburg
3.2 years ago by
United Kingdom
Peter Humburg30 wrote:

I'm using the  illuminaHumanv3.db package to obtain updated probe annotations as part the analysis pipeline for a project I'm currently working on. While examining the results of this analysis I noticed an inconsistency with the annotations.

Consider the following:

annot <- illuminaHumanv3fullReannotation()
dplyr::select(dplyr::filter(annot, SymbolReannotated == "HSPA1A"), IlluminaID:NuID,   EntrezReannotated, SymbolReannotated)

This produces the following output:

IlluminaID ArrayAddress NuID EntrezReannotated SymbolReannotated
ILMN_1789074 6380717 oon0If5P1yz97_0vdA 3303 HSPA1A
ILMN_1660436 3850433 Tiuh76h0KH_ee.1ztM 3304 HSPA1A

As you can see these two probes are annotated with the same gene symbol but different Entrez IDs. As far as I can tell the gene symbol associated with Entrez 3304 is actually HSPA1B (see here: The Entrez IDs appear to be consistent with the provided probe locations (chr6:31785490:31785539:+ and chr6:31797684:31797733:+), suggesting that the symbol is incorrect for the second of these two probes.

I haven't checked systematically for other inconsistencies but this seems a bit concerning to me or am I missing something obvious here?

R version 3.2.1 (2015-06-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux stretch/sid

 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] Biostrings_2.36.1         XVector_0.8.0             illuminaHumanv3.db_1.26.0        RSQLite_1.0.0            
 [6] DBI_0.3.1                 AnnotationDbi_1.30.1      GenomeInfoDb_1.4.1        IRanges_2.2.5             S4Vectors_0.6.2          
[11] Biobase_2.28.0            BiocGenerics_0.14.0      

loaded via a namespace (and not attached):
[1] zlibbioc_1.14.0 tools_3.2.1 


ADD COMMENTlink modified 3.1 years ago • written 3.2 years ago by Peter Humburg30

I agree with your findings - The below code returns the same result, with different Ensembl IDs. What project, or technology are you working with? Illumina Microarrays?

annot[grep("HSPA1A$", annot$SymbolReannotated),]
ADD REPLYlink written 3.2 years ago by andrew.j.skelton73310

Yes, I'm working with Illumina expression arrays for this project (specifically these are Illumina HumanHT12v3 arrays, hence the use of the annotation package). I did rely on the annotations to link probes to genes (and remove likely unreliable ones). Clearly the inconsistencies suggest that this may be a bad idea. Unfortunately it is unclear to me what caused this inconsistency and consequently which parts of the annotations may still be usable. I have now used the the probe coordinates provided by illuminaHumanv3.db and mapped those to gene symbols via TxDb.Hsapiens.UCSC.hg19.knownGene and Assuming that the probe coordinates are reliable that is fine for my immediate needs but doesn't really resolve the issue that that the illuminaHumanv3.db annotations just aren't right.

ADD REPLYlink written 3.1 years ago by Peter Humburg30

Have you thought about using lumiHumanAll.db instead? This code chunk should annotate your microarray, assuming that the rownames of your object are probe IDs. 

probe_list             <- rownames(normalised_data)
nuIDs                  <- probeID2nuID(probe_list)[, "nuID"]
symbol                 <- getSYMBOL(nuIDs, "lumiHumanAll.db")
name                   <- unlist(lookUp(nuIDs, "lumiHumanAll.db", "GENENAME"))
anno_df                <- data.frame(ID=nuIDs, 
ADD REPLYlink written 3.1 years ago by andrew.j.skelton73310

That certainly is an option to get the gene symbols, thanks for reminding me. That is certainly easier than going the TxDB route. I have used lumiHumanAll.db in the past but moved to the illuminaHuman packages because they provide information about overlapping SNPs and a general probe quality assessment, both of which I have found quite useful.

ADD REPLYlink written 3.1 years ago by Peter Humburg30

I agree, they provide a lot more useful information, but there's obviously some bugs that need working out! 

ADD REPLYlink written 3.1 years ago by andrew.j.skelton73310
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 415 users visited in the last hour