illuminaHumanv4 mappings

0

Entering edit mode

Mark Cowley ▴ 910

@mark-cowley-2951

Last seen 11.4 years ago

Hi Mark, Thanks for the detailed email, and a big thanks for going to the effort of the probe-remapping -- something that's been on my todo list for far too long. Can you please elaborate (or point me to a doc) on your probe mapping process? transcript to gene redundancy is the big issue here, which CTNND1 suffers from. What are your thoughts on a best guess strategy when there's ambiguity. If CTNND1 probes map to 1500 and 100528016, my vote is generally to choose the oldest record, since 1500 = CTNND1 catenin (cadherin-associated protein), delta 1 [ Homo sapiens ] 100528016 = TMX2-CTNND1 TMX2-CTNND1 readthrough (non-protein coding) [ Homo sapiens ] ... however, some additional coding that checks sequence identity among clashes may help resolve conflicts. toggleProbes will provide much of the raw data, but then there's lots of downstream work to re- do which the AnnotationDBi pipeline (and you) have already done. cheers, Mark On 27/09/2011, at 11:05 PM, Mark Dunning wrote: > Hi Mark, > > Thanks for pointing out this issue, as it does deserve more > clarification. The Refseq IDs used for the package do not come > directly from the Illumina manifest file. Rather we have taken the > probe sequences and done a re-mapping to the genome and transcriptome. > The RefSeq IDs that we assign during this re-mapping are the basis for > a set of standard mappings provided by the AnnotationDBi > infrastructure. > > However, as far as I know, probes that map to multiple EntrezIDs are > automatically filtered out. You can use the toggleProbes function to > change the usual mapping to return all return all values. > >> allEGs = toggleProbes(illuminaHumanv4ENTREZID, "all") > >> mget(ids, allEGs) > $ILMN_1651944 > [1] NA > > $ILMN_1807510 > [1] NA > > $ILMN_1696806 > [1] "100528016" "1500" > > $ILMN_1663159 > [1] NA > > $ILMN_2293511 > [1] "100528016" "1500" > > So two of the probes *do* have mappings, but they do not get mapped to > gene symbols because there is not one unique EntrezID. > > Aside from the usual Bioconductor mappings, we have added other > information collected during our re-annotation to the package. Of most > interest here is the Probe Quality score and Coding Zone. > >> unlist(mget(ids, illuminaHumanv4PROBEQUALITY)) > ILMN_1651944 ILMN_1807510 ILMN_1696806 ILMN_1663159 ILMN_2293511 > "Bad" "No match" "Perfect" "Bad" "Perfect" > >> unlist(mget(ids, illuminaHumanv4CODINGZONE)) > ILMN_1651944 ILMN_1807510 ILMN_1696806 ILMN_1663159 ILMN_2293511 > "Intronic" NA "5pUTR" "Intronic" "5pUTR" > > So one probe doesn't match to any part of the genome, two map to > introns and the other two uniqely map to a genomic location, but at > the 5' end of a gene. We did do our own mapping to Gene Symbol > (independent to the mapping done by Bioconductor). which would > correctly assign these probes to CTNND1. However, these mappings are > not currently part of the released packages. We plan to include them > in the next release though. > > Best wishes, > > Mark > > On Thu, Sep 22, 2011 at 10:58 AM, Mark Cowley <m.cowley@garvan.org.au> wrote: >> Dear list, >> I've read the illuminaHumanv4.db.pdf, and it's not clear to me how the mappings are built. From the short package description, I thought the RefSeq ID's from the illumina array manifest would be used, but according to the pdf manual, I think its ACCNUM, but we're not told from where the ACCNUM is derived (from ?illuminaHumanv4ACCNUM: "For chip packages such as this, the ACCNUM mapping comes directly from the manufacturer."). >> >> I raise the question, since within the illuminaHuman4SYMBOL table, there are no probes for the CTNND1 gene, whereas according to the manifest file, there are 5 probes that should map to that gene: >> >> from the manifest: >> $ grep -w CTNND1 HumanHT-12_V4_0_R2_15002873_B.txt | cut -f3,6,5,14 >> #Search_Key ILMN_Gene RefSeq_ID Symbol >> XM_943087.1 CTNND1 XM_943087.1 ILMN_1651944 >> XM_937008.1 CTNND1 XM_937008.1 ILMN_1807510 >> XM_943098.1 CTNND1 NM_001085458.1 ILMN_1696806 >> XM_943098.1 CTNND1 XM_943098.1 ILMN_1663159 >> NM_001331.1 CTNND1 NM_001331.1 ILMN_2293511 >> >> # from the illuminaHumanv4.db package >> require(illuminaHumanv4.db) >>> ids <- c("ILMN_1651944", "ILMN_1807510", "ILMN_1696806", "ILMN_1663159", "ILMN_2293511") >>> unlist(mget(ids, illuminaHumanv4SYMBOL)) >> ILMN_1651944 ILMN_1807510 ILMN_1696806 ILMN_1663159 ILMN_2293511 >> NA NA NA NA NA >>> unlist(mget(ids, illuminaHumanv4REFSEQ)) >> ILMN_1651944 ILMN_1807510 ILMN_1696806 ILMN_1663159 ILMN_2293511 >> NA NA NA NA NA >> # why are there no REFSEQID's for these probes? >> >>> mget(ids, illuminaHumanv4ACCNUM) >> $ILMN_1651944 >> [1] NA >> $ILMN_1807510 >> [1] NA >> $ILMN_1696806 >> [1] "NM_001085458" "NM_001085459" "NM_001085460" "NM_001085461" "NM_001085462" >> [6] "NM_001085463" "NM_001085464" "NM_001085465" "NM_001085466" "NM_001085467" >> [11] "NM_001085468" "NM_001085469" "NM_001331" "NR_037646" >> $ILMN_1663159 >> [1] NA >> $ILMN_2293511 >> [1] "NM_001085458" "NM_001085459" "NM_001085460" "NM_001085461" "NM_001085462" >> [6] "NM_001085463" "NM_001085464" "NM_001085465" "NM_001085466" "NM_001085467" >> [11] "NM_001085468" "NM_001085469" "NM_001331" "NR_037646" >> >> # all of these RefSeq ID's correspond to Entrez Gene ID 1500, CTNND1 catenin (cadherin-associated protein), delta 1 [ Homo sapiens ] >> # why do 3 probes not have an ACCNUM? >> >> >> If I BLAST all 5 probes, the 3 probes with NA in the ACCNUM (see above) all align to NG_029078.1 (=CTNND1), but not to NM_001331 (=CTNND1), and the 2 probes with lots of ACCNUM ID's align to both NG_029078.1 and NM_001331 amongst many others. >> mget(ids, illuminaHumanv4PROBESEQUENCE) >>> ILMN_1651944 -> NG_029078.1 >> GAAGGACCCTCCCCCGCTTCATAGTTTATGAATGCGAGAGTTGGTAAGGG >>> ILMN_1807510 -> NG_029078.1 >> CGGTCATTCTCTGCCATCCCTAGAAAGAATGTCCAATCCACTGCCTTTGT >>> ILMN_1696806 -> NG_029078.1, NM_001331, many others >> GACCATCCCAAAAAGGAAGTGCACCTTGGAGCCTGTGGAGCTCTCAAGAA >>> ILMN_1663159 -> NG_029078.1 >> GCCTATTCTTTAGCCTCCATTCCTATCTGTATTGCATACTGTAACTCCAA >>> ILMN_2293511 -> NG_029078.1, NM_001331, many others >> ATCCAGACTTTGGGTCGTGATTTCCGCAAGAATGGCAATGGGGGACCTGG >> >> >> >> I'd really love to get to the bottom of this, as the R annotation packages are very rich, but missing ID's make it hard to know whether they're better than the manufacturers manifest files. >> >> cheers, >> Mark >> ----------------------------------------------------- >> Mark Cowley, PhD >> >> Pancreatic Cancer Program | Peter Wills Bioinformatics Centre >> Garvan Institute of Medical Research, Sydney, Australia >> ----------------------------------------------------- >> >> >>> sessionInfo() >> R version 2.13.1 (2011-07-08) >> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) >> >> locale: >> [1] en_AU.UTF-8/en_AU.UTF-8/C/C/en_AU.UTF-8/en_AU.UTF-8 >> >> attached base packages: >> [1] graphics datasets grDevices utils grid stats methods >> [8] base >> >> other attached packages: >> [1] illuminaHumanv4.db_1.10.0 org.Hs.eg.db_2.5.0 >> [3] RSQLite_0.9-4 DBI_0.2-5 >> [5] AnnotationDbi_1.14.1 limma_3.8.3 >> [7] mjcdev_1.0 Cairo_1.4-9 >> [9] metaGSEA_1.0.2 pwbc_1.0.3 >> [11] lumidat_1.0.1 lumi_2.4.0 >> [13] nleqslv_1.8.6 updateR_1.0.4 >> [15] roxygen_0.1-3 digest_0.5.0 >> [17] codetools_0.2-8 haselst_0.1 >> [19] blat_0.1 genomics_0.1 >> [21] mjcbase_0.1 GEOquery_2.19.2 >> [23] cor_0.1 xtable_1.5-6 >> [25] rgl_0.92.798 qvalue_1.26.0 >> [27] igraph_0.5.5-2 graph_1.30.0 >> [29] XML_3.4-2 SparseM_0.89 >> [31] Biobase_2.12.2 sos_1.3-1 >> [33] brew_1.0-6 gplots_2.8.0 >> [35] caTools_1.12 bitops_1.0-4.1 >> [37] gdata_2.8.1 gtools_2.6.2 >> >> loaded via a namespace (and not attached): >> [1] affy_1.30.0 affyio_1.20.0 annotate_1.30.0 >> [4] hdrcde_2.15 KernSmooth_2.23-6 lattice_0.19-30 >> [7] MASS_7.3-13 Matrix_0.999375-50 methylumi_1.8.0 >> [10] mgcv_1.7-6 nlme_3.1-101 preprocessCore_1.14.0 >> [13] RCurl_1.6-7 tcltk_2.13.1 tools_2.13.1 >> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> [[alternative HTML version deleted]]

Infrastructure Cancer Homo sapiens probe AnnotationDbi ASSIGN Infrastructure Cancer probe • 1.8k views

ADD COMMENT • link 14.3 years ago Mark Cowley ▴ 910

Login before adding your answer.