inconsistency in illuminaHumanv4.db?
1
0
Entering edit mode
@perry-moerland-1109
Last seen 2.7 years ago
Bioinformatics Laboratory, Academic Med…
Dear all, dear Mark, I'm a grateful user of the illuminaHumanv4.db annotation package. One of my collaborators is interested in probes mapping to C1orf151 according to the reannotation provided by the package. However, the re-annotation for these probes seems inconsistent: > Illids = get("C1orf151",revmap(illuminaHumanv4SYMBOLREANNOTATED)) > Illids [1] "ILMN_2064311" "ILMN_1657860" "ILMN_1789599" "ILMN_2405009" > indx = match(Illids,illuminaHumanv4fullReannotation()[,1]) > tab = illuminaHumanv4fullReannotation()[indx,] > tab[,c(1,4,11:13,16)] IlluminaID ProbeQuality EntrezReannotated GenomicLocation SymbolReannotated EnsemblReannotated 4615 ILMN_2064311 Bad 440574 chr1:19954844:19954893:+ C1orf151 ENSG00000173436 24195 ILMN_1657860 Perfect 440574 chr1:19954399:19954448:+ C1orf151 ENSG00000173436 39363 ILMN_1789599 Perfect 440574 chr1:19984747:19984796:+ C1orf151 ENSG00000158747 46631 ILMN_2405009 Perfect 440574 chr1:19984595:19984644:+ C1orf151 ENSG00000158747 As you can see two probes map to ENSG00000173436 and the other two probes to ENSG00000158747. This is in agreement with their annotation on the Ensembl website. The reannotated Entrez Gene ID and the reannotated symbol, however, seem inconsistent with this. According to the Ensembl website and according to org.Hs.eg.db the annotation of the two ENSG IDs is: > IDs = unlist(mget(tab$EnsemblReannotated,org.Hs.egENSEMBL2EG)) > IDs ENSG00000173436 ENSG00000173436 ENSG000001587471 ENSG000001587472 ENSG000001587471 ENSG000001587472 "440574" "440574" "4681" "100532736" "4681" "100532736" unlist(mget(IDs,org.Hs.egSYMBOL)) 440574 440574 4681 100532736 4681 100532736 "MINOS1" "MINOS1" "NBL1" "MINOS1-NBL1" "NBL1" "MINOS1-NBL1" Note that C1orf151 is an alias for MINOS1 and that MINOS1 and NBL1 are neighboring genes on chromosome 1, MINOS-NBL1 is the readthrough transcript. How come that illuminaHumanv4.db links all 4 probes to a single Entrez Gene ID (440574) and a single symbol (C1orf151)? The more general question is probably, how identifier conversion is performed for the re-annotation. I tried to find a description in the package documentation and in Barbosa-Morais et al. (2010) but without success. best wishes, Perry --- Perry Moerland, PhD Room J1B-215 Bioinformatics Laboratory, Department of Clinical Epidemiology, Biostatistics and Bioinformatics Academic Medical Center, University of Amsterdam Postbus 22660, 1100 DD Amsterdam, The Netherlands tel: +31 20 5666945 p.d.moerland@amc.uva.nl<mailto:p.d.moerland@amc.uva.nl>, http://www.bioinformaticslaboratory.nl/ > sessionInfo() R version 3.0.2 (2013-09-25) Platform: i386-w64-mingw32/i386 (32-bit) locale: [1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252 LC_MONETARY=English_United Kingdom.1252 [4] LC_NUMERIC=C LC_TIME=English_United Kingdom.1252 attached base packages: [1] parallel stats graphics grDevices utils datasets methods base other attached packages: [1] illuminaHumanv4.db_1.20.0 org.Hs.eg.db_2.10.1 RSQLite_0.11.4 DBI_0.2-7 [5] AnnotationDbi_1.24.0 Biobase_2.22.0 BiocGenerics_0.8.0 loaded via a namespace (and not attached): [1] AnnotationForge_1.4.0 IRanges_1.20.4 stats4_3.0.2 ________________________________ AMC Disclaimer : http://www.amc.nl/disclaimer ________________________________ [[alternative HTML version deleted]]
Annotation Annotation • 1.3k views
ADD COMMENT
0
Entering edit mode
Mark Dunning ★ 1.1k
@mark-dunning-3319
Last seen 21 months ago
Sheffield, Uk
Hi Perry, Sorry for the delay in responding. I should explain that the annotation packages that we provide are built upon the results of an in-house Perl script (described in Barbosa-Morais et al) where we map probes to the genome and transcriptome separately and collate the results. As you point out, the resources used are not well-documented so it took a while to get the relevant information from the people that actually run the script. We hope to improve on this for future releases. As for your query, we were essentially using an old version of Unigene for cross-referencing. The last time the Perl script was run in September 2011, we used UniGene v230 which had an entry Hs.466662 with the gene symbol C1orf151 and Entrez gene ID 440574. The way these get associated with Illumina probes is through sequence cross-references in the UniGene entry. So, for example, the top BLAST hit in RefSeq for the first probe, ILMN_2064311, was the transcript NM_001204083 which is one of the cross-references in the UniGene record. The second of the four probes matches the same transcript while the other two match another RefSeq transcript, NM_001204089, that is also among the cross-reference sequences in the same UniGene record. In the current version of UniGene (v236) the gene symbol for that same record is now MINOS1. It still contains the same RefSeq transcript links so assuming those still came up as the top BLAST hits for these probes then we would still end up with all having the same Entrez Gene ID. The Ensembl gene IDs come directly from the BLAST search against the Ensembl transcripts. Mark On Fri, Nov 15, 2013 at 8:43 PM, P.D. Moerland <p.d.moerland@amc.uva.nl>wrote: > Dear all, dear Mark, > > I'm a grateful user of the illuminaHumanv4.db annotation package. One of > my collaborators is interested in probes mapping to C1orf151 according to > the reannotation provided by the package. However, the re-annotation for > these probes seems inconsistent: > > > Illids = get("C1orf151",revmap(illuminaHumanv4SYMBOLREANNOTATED)) > > Illids > [1] "ILMN_2064311" "ILMN_1657860" "ILMN_1789599" "ILMN_2405009" > > > indx = match(Illids,illuminaHumanv4fullReannotation()[,1]) > > tab = illuminaHumanv4fullReannotation()[indx,] > > tab[,c(1,4,11:13,16)] > IlluminaID ProbeQuality EntrezReannotated > GenomicLocation SymbolReannotated EnsemblReannotated > 4615 ILMN_2064311 Bad 440574 > chr1:19954844:19954893:+ C1orf151 ENSG00000173436 > 24195 ILMN_1657860 Perfect 440574 > chr1:19954399:19954448:+ C1orf151 ENSG00000173436 > 39363 ILMN_1789599 Perfect 440574 > chr1:19984747:19984796:+ C1orf151 ENSG00000158747 > 46631 ILMN_2405009 Perfect 440574 > chr1:19984595:19984644:+ C1orf151 ENSG00000158747 > > As you can see two probes map to ENSG00000173436 and the other two probes > to ENSG00000158747. This is in agreement with their annotation on the > Ensembl website. The reannotated Entrez Gene ID and the reannotated symbol, > however, seem inconsistent with this. According to the Ensembl website and > according to org.Hs.eg.db the annotation of the two ENSG IDs is: > > > IDs = unlist(mget(tab$EnsemblReannotated,org.Hs.egENSEMBL2EG)) > > IDs > ENSG00000173436 ENSG00000173436 ENSG000001587471 ENSG000001587472 > ENSG000001587471 ENSG000001587472 > "440574" "440574" > "4681" "100532736" "4681" > "100532736" > unlist(mget(IDs,org.Hs.egSYMBOL)) > 440574 440574 4681 > 100532736 4681 100532736 > "MINOS1" "MINOS1" "NBL1" "MINOS1-NBL1" "NBL1" > "MINOS1-NBL1" > > Note that C1orf151 is an alias for MINOS1 and that MINOS1 and NBL1 are > neighboring genes on chromosome 1, MINOS-NBL1 is the readthrough > transcript. > > How come that illuminaHumanv4.db links all 4 probes to a single Entrez > Gene ID (440574) and a single symbol (C1orf151)? The more general question > is probably, how identifier conversion is performed for the re- annotation. > I tried to find a description in the package documentation and in > Barbosa-Morais et al. (2010) but without success. > > best wishes, > Perry > > --- > > Perry Moerland, PhD > > Room J1B-215 > > Bioinformatics Laboratory, Department of Clinical Epidemiology, > Biostatistics and Bioinformatics > > Academic Medical Center, University of Amsterdam > > Postbus 22660, 1100 DD Amsterdam, The Netherlands > > tel: +31 20 5666945 > > p.d.moerland@amc.uva.nl, http://www.bioinformaticslaboratory.nl/ > > > > > sessionInfo() > > R version 3.0.2 (2013-09-25) > > Platform: i386-w64-mingw32/i386 (32-bit) > > > > locale: > > [1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United > Kingdom.1252 LC_MONETARY=English_United Kingdom.1252 > > [4] LC_NUMERIC=C LC_TIME=English_United > Kingdom.1252 > > > > attached base packages: > > [1] parallel stats graphics grDevices utils datasets methods > base > > > > other attached packages: > > [1] illuminaHumanv4.db_1.20.0 org.Hs.eg.db_2.10.1 > RSQLite_0.11.4 DBI_0.2-7 > > [5] AnnotationDbi_1.24.0 Biobase_2.22.0 > BiocGenerics_0.8.0 > > > > loaded via a namespace (and not attached): > > [1] AnnotationForge_1.4.0 IRanges_1.20.4 stats4_3.0.2 > > > > ------------------------------ > > AMC Disclaimer : http://www.amc.nl/disclaimer > ------------------------------ > > [[alternative HTML version deleted]]
ADD COMMENT

Login before adding your answer.

Traffic: 887 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6