Search
Question: Inconsistent Annotations in illuminaHumanv3.db
0
3.2 years ago by
United Kingdom
Peter Humburg30 wrote:

I'm using the  illuminaHumanv3.db package to obtain updated probe annotations as part the analysis pipeline for a project I'm currently working on. While examining the results of this analysis I noticed an inconsistency with the annotations.

Consider the following:

library(illuminaHumanv3.db)
annot <- illuminaHumanv3fullReannotation()
dplyr::select(dplyr::filter(annot, SymbolReannotated == "HSPA1A"), IlluminaID:NuID,   EntrezReannotated, SymbolReannotated)


This produces the following output:

 IlluminaID ArrayAddress NuID EntrezReannotated SymbolReannotated ILMN_1789074 6380717 oon0If5P1yz97_0vdA 3303 HSPA1A ILMN_1660436 3850433 Tiuh76h0KH_ee.1ztM 3304 HSPA1A

As you can see these two probes are annotated with the same gene symbol but different Entrez IDs. As far as I can tell the gene symbol associated with Entrez 3304 is actually HSPA1B (see here: http://www.ncbi.nlm.nih.gov/gene/?term=3304%5Buid%5D). The Entrez IDs appear to be consistent with the provided probe locations (chr6:31785490:31785539:+ and chr6:31797684:31797733:+), suggesting that the symbol is incorrect for the second of these two probes.

I haven't checked systematically for other inconsistencies but this seems a bit concerning to me or am I missing something obvious here?

sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux stretch/sid

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8
[6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] Biostrings_2.36.1         XVector_0.8.0             illuminaHumanv3.db_1.26.0 org.Hs.eg.db_3.1.2        RSQLite_1.0.0
[6] DBI_0.3.1                 AnnotationDbi_1.30.1      GenomeInfoDb_1.4.1        IRanges_2.2.5             S4Vectors_0.6.2
[11] Biobase_2.28.0            BiocGenerics_0.14.0

loaded via a namespace (and not attached):
[1] zlibbioc_1.14.0 tools_3.2.1 

modified 3.1 years ago • written 3.2 years ago by Peter Humburg30

I agree with your findings - The below code returns the same result, with different Ensembl IDs. What project, or technology are you working with? Illumina Microarrays?

annot[grep("HSPA1A$", annot$SymbolReannotated),]

Yes, I'm working with Illumina expression arrays for this project (specifically these are Illumina HumanHT12v3 arrays, hence the use of the annotation package). I did rely on the annotations to link probes to genes (and remove likely unreliable ones). Clearly the inconsistencies suggest that this may be a bad idea. Unfortunately it is unclear to me what caused this inconsistency and consequently which parts of the annotations may still be usable. I have now used the the probe coordinates provided by illuminaHumanv3.db and mapped those to gene symbols via TxDb.Hsapiens.UCSC.hg19.knownGene and org.Hs.eg.db. Assuming that the probe coordinates are reliable that is fine for my immediate needs but doesn't really resolve the issue that that the illuminaHumanv3.db annotations just aren't right.

1

Have you thought about using lumiHumanAll.db instead? This code chunk should annotate your microarray, assuming that the rownames of your object are probe IDs.

probe_list             <- rownames(normalised_data)
nuIDs                  <- probeID2nuID(probe_list)[, "nuID"]
symbol                 <- getSYMBOL(nuIDs, "lumiHumanAll.db")
name                   <- unlist(lookUp(nuIDs, "lumiHumanAll.db", "GENENAME"))
anno_df                <- data.frame(ID=nuIDs,
probe_list,
symbol,
name)​

That certainly is an option to get the gene symbols, thanks for reminding me. That is certainly easier than going the TxDB route. I have used lumiHumanAll.db in the past but moved to the illuminaHuman packages because they provide information about overlapping SNPs and a general probe quality assessment, both of which I have found quite useful.