I'm using the illuminaHumanv3.db package to obtain updated probe annotations as part the analysis pipeline for a project I'm currently working on. While examining the results of this analysis I noticed an inconsistency with the annotations.
Consider the following:
library(illuminaHumanv3.db) annot <- illuminaHumanv3fullReannotation() dplyr::select(dplyr::filter(annot, SymbolReannotated == "HSPA1A"), IlluminaID:NuID, EntrezReannotated, SymbolReannotated)
This produces the following output:
IlluminaID | ArrayAddress | NuID | EntrezReannotated | SymbolReannotated |
ILMN_1789074 | 6380717 | oon0If5P1yz97_0vdA | 3303 | HSPA1A |
ILMN_1660436 | 3850433 | Tiuh76h0KH_ee.1ztM | 3304 | HSPA1A |
As you can see these two probes are annotated with the same gene symbol but different Entrez IDs. As far as I can tell the gene symbol associated with Entrez 3304 is actually HSPA1B (see here: http://www.ncbi.nlm.nih.gov/gene/?term=3304%5Buid%5D). The Entrez IDs appear to be consistent with the provided probe locations (chr6:31785490:31785539:+ and chr6:31797684:31797733:+), suggesting that the symbol is incorrect for the second of these two probes.
I haven't checked systematically for other inconsistencies but this seems a bit concerning to me or am I missing something obvious here?
sessionInfo() R version 3.2.1 (2015-06-18) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Debian GNU/Linux stretch/sid locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 [6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods base other attached packages: [1] Biostrings_2.36.1 XVector_0.8.0 illuminaHumanv3.db_1.26.0 org.Hs.eg.db_3.1.2 RSQLite_1.0.0 [6] DBI_0.3.1 AnnotationDbi_1.30.1 GenomeInfoDb_1.4.1 IRanges_2.2.5 S4Vectors_0.6.2 [11] Biobase_2.28.0 BiocGenerics_0.14.0 loaded via a namespace (and not attached): [1] zlibbioc_1.14.0 tools_3.2.1
I agree with your findings - The below code returns the same result, with different Ensembl IDs. What project, or technology are you working with? Illumina Microarrays?
Yes, I'm working with Illumina expression arrays for this project (specifically these are Illumina HumanHT12v3 arrays, hence the use of the annotation package). I did rely on the annotations to link probes to genes (and remove likely unreliable ones). Clearly the inconsistencies suggest that this may be a bad idea. Unfortunately it is unclear to me what caused this inconsistency and consequently which parts of the annotations may still be usable. I have now used the the probe coordinates provided by illuminaHumanv3.db and mapped those to gene symbols via TxDb.Hsapiens.UCSC.hg19.knownGene and org.Hs.eg.db. Assuming that the probe coordinates are reliable that is fine for my immediate needs but doesn't really resolve the issue that that the illuminaHumanv3.db annotations just aren't right.
Have you thought about using lumiHumanAll.db instead? This code chunk should annotate your microarray, assuming that the rownames of your object are probe IDs.
That certainly is an option to get the gene symbols, thanks for reminding me. That is certainly easier than going the TxDB route. I have used lumiHumanAll.db in the past but moved to the illuminaHuman packages because they provide information about overlapping SNPs and a general probe quality assessment, both of which I have found quite useful.
I agree, they provide a lot more useful information, but there's obviously some bugs that need working out!