illuminaHumanv4 mappings

0

Entering edit mode

Mark Cowley ▴ 910

@mark-cowley-2951

Last seen 11.5 years ago

Dear list, I've read the illuminaHumanv4.db.pdf, and it's not clear to me how the mappings are built. From the short package description, I thought the RefSeq ID's from the illumina array manifest would be used, but according to the pdf manual, I think its ACCNUM, but we're not told from where the ACCNUM is derived (from ?illuminaHumanv4ACCNUM: "For chip packages such as this, the ACCNUM mapping comes directly from the manufacturer."). I raise the question, since within the illuminaHuman4SYMBOL table, there are no probes for the CTNND1 gene, whereas according to the manifest file, there are 5 probes that should map to that gene: from the manifest: $ grep -w CTNND1 HumanHT-12_V4_0_R2_15002873_B.txt | cut -f3,6,5,14 #Search_Key ILMN_Gene RefSeq_ID Symbol XM_943087.1 CTNND1 XM_943087.1 ILMN_1651944 XM_937008.1 CTNND1 XM_937008.1 ILMN_1807510 XM_943098.1 CTNND1 NM_001085458.1 ILMN_1696806 XM_943098.1 CTNND1 XM_943098.1 ILMN_1663159 NM_001331.1 CTNND1 NM_001331.1 ILMN_2293511 # from the illuminaHumanv4.db package require(illuminaHumanv4.db) > ids <- c("ILMN_1651944", "ILMN_1807510", "ILMN_1696806", "ILMN_1663159", "ILMN_2293511") > unlist(mget(ids, illuminaHumanv4SYMBOL)) ILMN_1651944 ILMN_1807510 ILMN_1696806 ILMN_1663159 ILMN_2293511 NA NA NA NA NA > unlist(mget(ids, illuminaHumanv4REFSEQ)) ILMN_1651944 ILMN_1807510 ILMN_1696806 ILMN_1663159 ILMN_2293511 NA NA NA NA NA # why are there no REFSEQID's for these probes? > mget(ids, illuminaHumanv4ACCNUM) $ILMN_1651944 [1] NA $ILMN_1807510 [1] NA $ILMN_1696806 [1] "NM_001085458" "NM_001085459" "NM_001085460" "NM_001085461" "NM_001085462" [6] "NM_001085463" "NM_001085464" "NM_001085465" "NM_001085466" "NM_001085467" [11] "NM_001085468" "NM_001085469" "NM_001331" "NR_037646" $ILMN_1663159 [1] NA $ILMN_2293511 [1] "NM_001085458" "NM_001085459" "NM_001085460" "NM_001085461" "NM_001085462" [6] "NM_001085463" "NM_001085464" "NM_001085465" "NM_001085466" "NM_001085467" [11] "NM_001085468" "NM_001085469" "NM_001331" "NR_037646" # all of these RefSeq ID's correspond to Entrez Gene ID 1500, CTNND1 catenin (cadherin-associated protein), delta 1 [ Homo sapiens ] # why do 3 probes not have an ACCNUM? If I BLAST all 5 probes, the 3 probes with NA in the ACCNUM (see above) all align to NG_029078.1 (=CTNND1), but not to NM_001331 (=CTNND1), and the 2 probes with lots of ACCNUM ID's align to both NG_029078.1 and NM_001331 amongst many others. mget(ids, illuminaHumanv4PROBESEQUENCE) >ILMN_1651944 -> NG_029078.1 GAAGGACCCTCCCCCGCTTCATAGTTTATGAATGCGAGAGTTGGTAAGGG >ILMN_1807510 -> NG_029078.1 CGGTCATTCTCTGCCATCCCTAGAAAGAATGTCCAATCCACTGCCTTTGT >ILMN_1696806 -> NG_029078.1, NM_001331, many others GACCATCCCAAAAAGGAAGTGCACCTTGGAGCCTGTGGAGCTCTCAAGAA >ILMN_1663159 -> NG_029078.1 GCCTATTCTTTAGCCTCCATTCCTATCTGTATTGCATACTGTAACTCCAA >ILMN_2293511 -> NG_029078.1, NM_001331, many others ATCCAGACTTTGGGTCGTGATTTCCGCAAGAATGGCAATGGGGGACCTGG I'd really love to get to the bottom of this, as the R annotation packages are very rich, but missing ID's make it hard to know whether they're better than the manufacturers manifest files. cheers, Mark ----------------------------------------------------- Mark Cowley, PhD Pancreatic Cancer Program | Peter Wills Bioinformatics Centre Garvan Institute of Medical Research, Sydney, Australia ----------------------------------------------------- > sessionInfo() R version 2.13.1 (2011-07-08) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] en_AU.UTF-8/en_AU.UTF-8/C/C/en_AU.UTF-8/en_AU.UTF-8 attached base packages: [1] graphics datasets grDevices utils grid stats methods [8] base other attached packages: [1] illuminaHumanv4.db_1.10.0 org.Hs.eg.db_2.5.0 [3] RSQLite_0.9-4 DBI_0.2-5 [5] AnnotationDbi_1.14.1 limma_3.8.3 [7] mjcdev_1.0 Cairo_1.4-9 [9] metaGSEA_1.0.2 pwbc_1.0.3 [11] lumidat_1.0.1 lumi_2.4.0 [13] nleqslv_1.8.6 updateR_1.0.4 [15] roxygen_0.1-3 digest_0.5.0 [17] codetools_0.2-8 haselst_0.1 [19] blat_0.1 genomics_0.1 [21] mjcbase_0.1 GEOquery_2.19.2 [23] cor_0.1 xtable_1.5-6 [25] rgl_0.92.798 qvalue_1.26.0 [27] igraph_0.5.5-2 graph_1.30.0 [29] XML_3.4-2 SparseM_0.89 [31] Biobase_2.12.2 sos_1.3-1 [33] brew_1.0-6 gplots_2.8.0 [35] caTools_1.12 bitops_1.0-4.1 [37] gdata_2.8.1 gtools_2.6.2 loaded via a namespace (and not attached): [1] affy_1.30.0 affyio_1.20.0 annotate_1.30.0 [4] hdrcde_2.15 KernSmooth_2.23-6 lattice_0.19-30 [7] MASS_7.3-13 Matrix_0.999375-50 methylumi_1.8.0 [10] mgcv_1.7-6 nlme_3.1-101 preprocessCore_1.14.0 [13] RCurl_1.6-7 tcltk_2.13.1 tools_2.13.1 [[alternative HTML version deleted]]

Cancer Homo sapiens Cancer Homo sapiens • 1.6k views

ADD COMMENT • link updated 14.4 years ago by Mark Dunning ★ 1.1k • written 14.4 years ago by Mark Cowley ▴ 910

0

Entering edit mode

Mark Dunning ★ 1.1k

@mark-dunning-3319

Last seen 11 months ago

Sheffield, Uk

Hi Mark, Thanks for pointing out this issue, as it does deserve more clarification. The Refseq IDs used for the package do not come directly from the Illumina manifest file. Rather we have taken the probe sequences and done a re-mapping to the genome and transcriptome. The RefSeq IDs that we assign during this re-mapping are the basis for a set of standard mappings provided by the AnnotationDBi infrastructure. However, as far as I know, probes that map to multiple EntrezIDs are automatically filtered out. You can use the toggleProbes function to change the usual mapping to return all return all values. > allEGs = toggleProbes(illuminaHumanv4ENTREZID, "all") > mget(ids, allEGs) $ILMN_1651944 [1] NA $ILMN_1807510 [1] NA $ILMN_1696806 [1] "100528016" "1500" $ILMN_1663159 [1] NA $ILMN_2293511 [1] "100528016" "1500" So two of the probes *do* have mappings, but they do not get mapped to gene symbols because there is not one unique EntrezID. Aside from the usual Bioconductor mappings, we have added other information collected during our re-annotation to the package. Of most interest here is the Probe Quality score and Coding Zone. > unlist(mget(ids, illuminaHumanv4PROBEQUALITY)) ILMN_1651944 ILMN_1807510 ILMN_1696806 ILMN_1663159 ILMN_2293511 "Bad" "No match" "Perfect" "Bad" "Perfect" > unlist(mget(ids, illuminaHumanv4CODINGZONE)) ILMN_1651944 ILMN_1807510 ILMN_1696806 ILMN_1663159 ILMN_2293511 "Intronic" NA "5pUTR" "Intronic" "5pUTR" So one probe doesn't match to any part of the genome, two map to introns and the other two uniqely map to a genomic location, but at the 5' end of a gene. We did do our own mapping to Gene Symbol (independent to the mapping done by Bioconductor). which would correctly assign these probes to CTNND1. However, these mappings are not currently part of the released packages. We plan to include them in the next release though. Best wishes, Mark On Thu, Sep 22, 2011 at 10:58 AM, Mark Cowley <m.cowley at="" garvan.org.au=""> wrote: > Dear list, > I've read the illuminaHumanv4.db.pdf, and it's not clear to me how the mappings are built. From the short package description, I thought the RefSeq ID's from the illumina array manifest would be used, but according to the pdf manual, I think its ACCNUM, but we're not told from where the ACCNUM is derived (from ?illuminaHumanv4ACCNUM: "For chip packages such as this, the ACCNUM mapping comes directly from the manufacturer."). > > I raise the question, since within the illuminaHuman4SYMBOL table, there are no probes for the CTNND1 gene, whereas according to the manifest file, there are 5 probes that should map to that gene: > > from the manifest: > $ grep -w CTNND1 HumanHT-12_V4_0_R2_15002873_B.txt | cut -f3,6,5,14 > #Search_Key ? ? ILMN_Gene ? ? ? RefSeq_ID ? ? ? Symbol > XM_943087.1 ? ? CTNND1 ?XM_943087.1 ? ? ILMN_1651944 > XM_937008.1 ? ? CTNND1 ?XM_937008.1 ? ? ILMN_1807510 > XM_943098.1 ? ? CTNND1 ?NM_001085458.1 ?ILMN_1696806 > XM_943098.1 ? ? CTNND1 ?XM_943098.1 ? ? ILMN_1663159 > NM_001331.1 ? ? CTNND1 ?NM_001331.1 ? ? ILMN_2293511 > > # from the illuminaHumanv4.db package > require(illuminaHumanv4.db) >> ids <- c("ILMN_1651944", "ILMN_1807510", "ILMN_1696806", "ILMN_1663159", "ILMN_2293511") >> unlist(mget(ids, illuminaHumanv4SYMBOL)) > ILMN_1651944 ILMN_1807510 ILMN_1696806 ILMN_1663159 ILMN_2293511 > ? ? ? ? ?NA ? ? ? ? ? NA ? ? ? ? ? NA ? ? ? ? ? NA ? ? ? ? ? NA >> unlist(mget(ids, illuminaHumanv4REFSEQ)) > ILMN_1651944 ILMN_1807510 ILMN_1696806 ILMN_1663159 ILMN_2293511 > ? ? ? ? ?NA ? ? ? ? ? NA ? ? ? ? ? NA ? ? ? ? ? NA ? ? ? ? ? NA > # why are there no REFSEQID's for these probes? > >> mget(ids, illuminaHumanv4ACCNUM) > $ILMN_1651944 > [1] NA > $ILMN_1807510 > [1] NA > $ILMN_1696806 > ?[1] "NM_001085458" "NM_001085459" "NM_001085460" "NM_001085461" "NM_001085462" > ?[6] "NM_001085463" "NM_001085464" "NM_001085465" "NM_001085466" "NM_001085467" > [11] "NM_001085468" "NM_001085469" "NM_001331" ? ?"NR_037646" > $ILMN_1663159 > [1] NA > $ILMN_2293511 > ?[1] "NM_001085458" "NM_001085459" "NM_001085460" "NM_001085461" "NM_001085462" > ?[6] "NM_001085463" "NM_001085464" "NM_001085465" "NM_001085466" "NM_001085467" > [11] "NM_001085468" "NM_001085469" "NM_001331" ? ?"NR_037646" > > # all of these RefSeq ID's correspond to Entrez Gene ID 1500, CTNND1 catenin (cadherin-associated protein), delta 1 [ Homo sapiens ] > # why do 3 probes not have an ACCNUM? > > > If I BLAST all 5 probes, the 3 probes with NA in the ACCNUM (see above) all align to NG_029078.1 (=CTNND1), but not to NM_001331 (=CTNND1), and the 2 probes with lots of ACCNUM ID's align to both NG_029078.1 and NM_001331 amongst many others. > mget(ids, illuminaHumanv4PROBESEQUENCE) >>ILMN_1651944 -> NG_029078.1 > GAAGGACCCTCCCCCGCTTCATAGTTTATGAATGCGAGAGTTGGTAAGGG >>ILMN_1807510 -> NG_029078.1 > CGGTCATTCTCTGCCATCCCTAGAAAGAATGTCCAATCCACTGCCTTTGT >>ILMN_1696806 -> NG_029078.1, NM_001331, many others > GACCATCCCAAAAAGGAAGTGCACCTTGGAGCCTGTGGAGCTCTCAAGAA >>ILMN_1663159 -> NG_029078.1 > GCCTATTCTTTAGCCTCCATTCCTATCTGTATTGCATACTGTAACTCCAA >>ILMN_2293511 -> NG_029078.1, NM_001331, many others > ATCCAGACTTTGGGTCGTGATTTCCGCAAGAATGGCAATGGGGGACCTGG > > > > I'd really love to get to the bottom of this, as the R annotation packages are very rich, but missing ID's make it hard to know whether they're better than the manufacturers manifest files. > > cheers, > Mark > ----------------------------------------------------- > Mark Cowley, PhD > > Pancreatic Cancer Program | Peter Wills Bioinformatics Centre > Garvan Institute of Medical Research, Sydney, Australia > ----------------------------------------------------- > > >> sessionInfo() > R version 2.13.1 (2011-07-08) > Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) > > locale: > [1] en_AU.UTF-8/en_AU.UTF-8/C/C/en_AU.UTF-8/en_AU.UTF-8 > > attached base packages: > [1] graphics ?datasets ?grDevices utils ? ? grid ? ? ?stats ? ? methods > [8] base > > other attached packages: > ?[1] illuminaHumanv4.db_1.10.0 org.Hs.eg.db_2.5.0 > ?[3] RSQLite_0.9-4 ? ? ? ? ? ? DBI_0.2-5 > ?[5] AnnotationDbi_1.14.1 ? ? ?limma_3.8.3 > ?[7] mjcdev_1.0 ? ? ? ? ? ? ? ?Cairo_1.4-9 > ?[9] metaGSEA_1.0.2 ? ? ? ? ? ?pwbc_1.0.3 > [11] lumidat_1.0.1 ? ? ? ? ? ? lumi_2.4.0 > [13] nleqslv_1.8.6 ? ? ? ? ? ? updateR_1.0.4 > [15] roxygen_0.1-3 ? ? ? ? ? ? digest_0.5.0 > [17] codetools_0.2-8 ? ? ? ? ? haselst_0.1 > [19] blat_0.1 ? ? ? ? ? ? ? ? ?genomics_0.1 > [21] mjcbase_0.1 ? ? ? ? ? ? ? GEOquery_2.19.2 > [23] cor_0.1 ? ? ? ? ? ? ? ? ? xtable_1.5-6 > [25] rgl_0.92.798 ? ? ? ? ? ? ?qvalue_1.26.0 > [27] igraph_0.5.5-2 ? ? ? ? ? ?graph_1.30.0 > [29] XML_3.4-2 ? ? ? ? ? ? ? ? SparseM_0.89 > [31] Biobase_2.12.2 ? ? ? ? ? ?sos_1.3-1 > [33] brew_1.0-6 ? ? ? ? ? ? ? ?gplots_2.8.0 > [35] caTools_1.12 ? ? ? ? ? ? ?bitops_1.0-4.1 > [37] gdata_2.8.1 ? ? ? ? ? ? ? gtools_2.6.2 > > loaded via a namespace (and not attached): > ?[1] affy_1.30.0 ? ? ? ? ? affyio_1.20.0 ? ? ? ? annotate_1.30.0 > ?[4] hdrcde_2.15 ? ? ? ? ? KernSmooth_2.23-6 ? ? lattice_0.19-30 > ?[7] MASS_7.3-13 ? ? ? ? ? Matrix_0.999375-50 ? ?methylumi_1.8.0 > [10] mgcv_1.7-6 ? ? ? ? ? ?nlme_3.1-101 ? ? ? ? ?preprocessCore_1.14.0 > [13] RCurl_1.6-7 ? ? ? ? ? tcltk_2.13.1 ? ? ? ? ?tools_2.13.1 > > > ? ? ? ?[[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD COMMENT • link 14.4 years ago Mark Dunning ★ 1.1k

Login before adding your answer.