Question

Inconsistent Annotations in illuminaHumanv3.db

0

Entering edit mode

Peter Humburg ▴ 30

@peter-humburg-4272

Last seen 10.5 years ago

United Kingdom

I'm using the illuminaHumanv3.db package to obtain updated probe annotations as part the analysis pipeline for a project I'm currently working on. While examining the results of this analysis I noticed an inconsistency with the annotations.

Consider the following:

library(illuminaHumanv3.db)
annot <- illuminaHumanv3fullReannotation()
dplyr::select(dplyr::filter(annot, SymbolReannotated == "HSPA1A"), IlluminaID:NuID,   EntrezReannotated, SymbolReannotated)

This produces the following output:

IlluminaID	ArrayAddress	NuID	EntrezReannotated	SymbolReannotated
ILMN_1789074	6380717	oon0If5P1yz97_0vdA	3303	HSPA1A
ILMN_1660436	3850433	Tiuh76h0KH_ee.1ztM	3304	HSPA1A

As you can see these two probes are annotated with the same gene symbol but different Entrez IDs. As far as I can tell the gene symbol associated with Entrez 3304 is actually HSPA1B (see here: http://www.ncbi.nlm.nih.gov/gene/?term=3304%5Buid%5D). The Entrez IDs appear to be consistent with the provided probe locations (chr6:31785490:31785539:+ and chr6:31797684:31797733:+), suggesting that the symbol is incorrect for the second of these two probes.

I haven't checked systematically for other inconsistencies but this seems a bit concerning to me or am I missing something obvious here?

sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux stretch/sid

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] Biostrings_2.36.1         XVector_0.8.0             illuminaHumanv3.db_1.26.0 org.Hs.eg.db_3.1.2        RSQLite_1.0.0            
 [6] DBI_0.3.1                 AnnotationDbi_1.30.1      GenomeInfoDb_1.4.1        IRanges_2.2.5             S4Vectors_0.6.2          
[11] Biobase_2.28.0            BiocGenerics_0.14.0      

loaded via a namespace (and not attached):
[1] zlibbioc_1.14.0 tools_3.2.1

illuminahumanv3 annotation microarray • 2.5k views

ADD COMMENT • link 10.5 years ago Peter Humburg ▴ 30

0

Entering edit mode

I agree with your findings - The below code returns the same result, with different Ensembl IDs. What project, or technology are you working with? Illumina Microarrays?

annot[grep("HSPA1A$", annot$SymbolReannotated),]

ADD REPLY • link 10.5 years ago andrew.j.skelton73 ▴ 370

0

Entering edit mode

Yes, I'm working with Illumina expression arrays for this project (specifically these are Illumina HumanHT12v3 arrays, hence the use of the annotation package). I did rely on the annotations to link probes to genes (and remove likely unreliable ones). Clearly the inconsistencies suggest that this may be a bad idea. Unfortunately it is unclear to me what caused this inconsistency and consequently which parts of the annotations may still be usable. I have now used the the probe coordinates provided by illuminaHumanv3.db and mapped those to gene symbols via TxDb.Hsapiens.UCSC.hg19.knownGene and org.Hs.eg.db. Assuming that the probe coordinates are reliable that is fine for my immediate needs but doesn't really resolve the issue that that the illuminaHumanv3.db annotations just aren't right.

ADD REPLY • link 10.5 years ago Peter Humburg ▴ 30

1

Entering edit mode

Have you thought about using lumiHumanAll.db instead? This code chunk should annotate your microarray, assuming that the rownames of your object are probe IDs.

probe_list             <- rownames(normalised_data)
nuIDs                  <- probeID2nuID(probe_list)[, "nuID"]
symbol                 <- getSYMBOL(nuIDs, "lumiHumanAll.db")
name                   <- unlist(lookUp(nuIDs, "lumiHumanAll.db", "GENENAME"))
anno_df                <- data.frame(ID=nuIDs, 
                                     probe_list, 
                                     symbol, 
                                     name)

ADD REPLY • link 10.5 years ago andrew.j.skelton73 ▴ 370

0

Entering edit mode

That certainly is an option to get the gene symbols, thanks for reminding me. That is certainly easier than going the TxDB route. I have used lumiHumanAll.db in the past but moved to the illuminaHuman packages because they provide information about overlapping SNPs and a general probe quality assessment, both of which I have found quite useful.

ADD REPLY • link 10.5 years ago Peter Humburg ▴ 30

0

Entering edit mode

I agree, they provide a lot more useful information, but there's obviously some bugs that need working out!

ADD REPLY • link 10.5 years ago andrew.j.skelton73 ▴ 370