Question: illuminaHumanv4 mappings
0
gravatar for Mark Cowley
8.2 years ago by
Mark Cowley910
Mark Cowley910 wrote:
Hi Mark, Thanks for the detailed email, and a big thanks for going to the effort of the probe-remapping -- something that's been on my todo list for far too long. Can you please elaborate (or point me to a doc) on your probe mapping process? transcript to gene redundancy is the big issue here, which CTNND1 suffers from. What are your thoughts on a best guess strategy when there's ambiguity. If CTNND1 probes map to 1500 and 100528016, my vote is generally to choose the oldest record, since 1500 = CTNND1 catenin (cadherin-associated protein), delta 1 [ Homo sapiens ] 100528016 = TMX2-CTNND1 TMX2-CTNND1 readthrough (non-protein coding) [ Homo sapiens ] ... however, some additional coding that checks sequence identity among clashes may help resolve conflicts. toggleProbes will provide much of the raw data, but then there's lots of downstream work to re- do which the AnnotationDBi pipeline (and you) have already done. cheers, Mark On 27/09/2011, at 11:05 PM, Mark Dunning wrote: > Hi Mark, > > Thanks for pointing out this issue, as it does deserve more > clarification. The Refseq IDs used for the package do not come > directly from the Illumina manifest file. Rather we have taken the > probe sequences and done a re-mapping to the genome and transcriptome. > The RefSeq IDs that we assign during this re-mapping are the basis for > a set of standard mappings provided by the AnnotationDBi > infrastructure. > > However, as far as I know, probes that map to multiple EntrezIDs are > automatically filtered out. You can use the toggleProbes function to > change the usual mapping to return all return all values. > >> allEGs = toggleProbes(illuminaHumanv4ENTREZID, "all") > >> mget(ids, allEGs) > $ILMN_1651944 > [1] NA > > $ILMN_1807510 > [1] NA > > $ILMN_1696806 > [1] "100528016" "1500" > > $ILMN_1663159 > [1] NA > > $ILMN_2293511 > [1] "100528016" "1500" > > So two of the probes *do* have mappings, but they do not get mapped to > gene symbols because there is not one unique EntrezID. > > Aside from the usual Bioconductor mappings, we have added other > information collected during our re-annotation to the package. Of most > interest here is the Probe Quality score and Coding Zone. > >> unlist(mget(ids, illuminaHumanv4PROBEQUALITY)) > ILMN_1651944 ILMN_1807510 ILMN_1696806 ILMN_1663159 ILMN_2293511 > "Bad" "No match" "Perfect" "Bad" "Perfect" > >> unlist(mget(ids, illuminaHumanv4CODINGZONE)) > ILMN_1651944 ILMN_1807510 ILMN_1696806 ILMN_1663159 ILMN_2293511 > "Intronic" NA "5pUTR" "Intronic" "5pUTR" > > So one probe doesn't match to any part of the genome, two map to > introns and the other two uniqely map to a genomic location, but at > the 5' end of a gene. We did do our own mapping to Gene Symbol > (independent to the mapping done by Bioconductor). which would > correctly assign these probes to CTNND1. However, these mappings are > not currently part of the released packages. We plan to include them > in the next release though. > > Best wishes, > > Mark > > On Thu, Sep 22, 2011 at 10:58 AM, Mark Cowley <m.cowley@garvan.org.au> wrote: >> Dear list, >> I've read the illuminaHumanv4.db.pdf, and it's not clear to me how the mappings are built. From the short package description, I thought the RefSeq ID's from the illumina array manifest would be used, but according to the pdf manual, I think its ACCNUM, but we're not told from where the ACCNUM is derived (from ?illuminaHumanv4ACCNUM: "For chip packages such as this, the ACCNUM mapping comes directly from the manufacturer."). >> >> I raise the question, since within the illuminaHuman4SYMBOL table, there are no probes for the CTNND1 gene, whereas according to the manifest file, there are 5 probes that should map to that gene: >> >> from the manifest: >> $ grep -w CTNND1 HumanHT-12_V4_0_R2_15002873_B.txt | cut -f3,6,5,14 >> #Search_Key ILMN_Gene RefSeq_ID Symbol >> XM_943087.1 CTNND1 XM_943087.1 ILMN_1651944 >> XM_937008.1 CTNND1 XM_937008.1 ILMN_1807510 >> XM_943098.1 CTNND1 NM_001085458.1 ILMN_1696806 >> XM_943098.1 CTNND1 XM_943098.1 ILMN_1663159 >> NM_001331.1 CTNND1 NM_001331.1 ILMN_2293511 >> >> # from the illuminaHumanv4.db package >> require(illuminaHumanv4.db) >>> ids <- c("ILMN_1651944", "ILMN_1807510", "ILMN_1696806", "ILMN_1663159", "ILMN_2293511") >>> unlist(mget(ids, illuminaHumanv4SYMBOL)) >> ILMN_1651944 ILMN_1807510 ILMN_1696806 ILMN_1663159 ILMN_2293511 >> NA NA NA NA NA >>> unlist(mget(ids, illuminaHumanv4REFSEQ)) >> ILMN_1651944 ILMN_1807510 ILMN_1696806 ILMN_1663159 ILMN_2293511 >> NA NA NA NA NA >> # why are there no REFSEQID's for these probes? >> >>> mget(ids, illuminaHumanv4ACCNUM) >> $ILMN_1651944 >> [1] NA >> $ILMN_1807510 >> [1] NA >> $ILMN_1696806 >> [1] "NM_001085458" "NM_001085459" "NM_001085460" "NM_001085461" "NM_001085462" >> [6] "NM_001085463" "NM_001085464" "NM_001085465" "NM_001085466" "NM_001085467" >> [11] "NM_001085468" "NM_001085469" "NM_001331" "NR_037646" >> $ILMN_1663159 >> [1] NA >> $ILMN_2293511 >> [1] "NM_001085458" "NM_001085459" "NM_001085460" "NM_001085461" "NM_001085462" >> [6] "NM_001085463" "NM_001085464" "NM_001085465" "NM_001085466" "NM_001085467" >> [11] "NM_001085468" "NM_001085469" "NM_001331" "NR_037646" >> >> # all of these RefSeq ID's correspond to Entrez Gene ID 1500, CTNND1 catenin (cadherin-associated protein), delta 1 [ Homo sapiens ] >> # why do 3 probes not have an ACCNUM? >> >> >> If I BLAST all 5 probes, the 3 probes with NA in the ACCNUM (see above) all align to NG_029078.1 (=CTNND1), but not to NM_001331 (=CTNND1), and the 2 probes with lots of ACCNUM ID's align to both NG_029078.1 and NM_001331 amongst many others. >> mget(ids, illuminaHumanv4PROBESEQUENCE) >>> ILMN_1651944 -> NG_029078.1 >> GAAGGACCCTCCCCCGCTTCATAGTTTATGAATGCGAGAGTTGGTAAGGG >>> ILMN_1807510 -> NG_029078.1 >> CGGTCATTCTCTGCCATCCCTAGAAAGAATGTCCAATCCACTGCCTTTGT >>> ILMN_1696806 -> NG_029078.1, NM_001331, many others >> GACCATCCCAAAAAGGAAGTGCACCTTGGAGCCTGTGGAGCTCTCAAGAA >>> ILMN_1663159 -> NG_029078.1 >> GCCTATTCTTTAGCCTCCATTCCTATCTGTATTGCATACTGTAACTCCAA >>> ILMN_2293511 -> NG_029078.1, NM_001331, many others >> ATCCAGACTTTGGGTCGTGATTTCCGCAAGAATGGCAATGGGGGACCTGG >> >> >> >> I'd really love to get to the bottom of this, as the R annotation packages are very rich, but missing ID's make it hard to know whether they're better than the manufacturers manifest files. >> >> cheers, >> Mark >> ----------------------------------------------------- >> Mark Cowley, PhD >> >> Pancreatic Cancer Program | Peter Wills Bioinformatics Centre >> Garvan Institute of Medical Research, Sydney, Australia >> ----------------------------------------------------- >> >> >>> sessionInfo() >> R version 2.13.1 (2011-07-08) >> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) >> >> locale: >> [1] en_AU.UTF-8/en_AU.UTF-8/C/C/en_AU.UTF-8/en_AU.UTF-8 >> >> attached base packages: >> [1] graphics datasets grDevices utils grid stats methods >> [8] base >> >> other attached packages: >> [1] illuminaHumanv4.db_1.10.0 org.Hs.eg.db_2.5.0 >> [3] RSQLite_0.9-4 DBI_0.2-5 >> [5] AnnotationDbi_1.14.1 limma_3.8.3 >> [7] mjcdev_1.0 Cairo_1.4-9 >> [9] metaGSEA_1.0.2 pwbc_1.0.3 >> [11] lumidat_1.0.1 lumi_2.4.0 >> [13] nleqslv_1.8.6 updateR_1.0.4 >> [15] roxygen_0.1-3 digest_0.5.0 >> [17] codetools_0.2-8 haselst_0.1 >> [19] blat_0.1 genomics_0.1 >> [21] mjcbase_0.1 GEOquery_2.19.2 >> [23] cor_0.1 xtable_1.5-6 >> [25] rgl_0.92.798 qvalue_1.26.0 >> [27] igraph_0.5.5-2 graph_1.30.0 >> [29] XML_3.4-2 SparseM_0.89 >> [31] Biobase_2.12.2 sos_1.3-1 >> [33] brew_1.0-6 gplots_2.8.0 >> [35] caTools_1.12 bitops_1.0-4.1 >> [37] gdata_2.8.1 gtools_2.6.2 >> >> loaded via a namespace (and not attached): >> [1] affy_1.30.0 affyio_1.20.0 annotate_1.30.0 >> [4] hdrcde_2.15 KernSmooth_2.23-6 lattice_0.19-30 >> [7] MASS_7.3-13 Matrix_0.999375-50 methylumi_1.8.0 >> [10] mgcv_1.7-6 nlme_3.1-101 preprocessCore_1.14.0 >> [13] RCurl_1.6-7 tcltk_2.13.1 tools_2.13.1 >> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> [[alternative HTML version deleted]]
ADD COMMENTlink written 8.2 years ago by Mark Cowley910
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 174 users visited in the last hour