Entering edit mode
Mark Cowley
▴
910
@mark-cowley-2951
Last seen 10.2 years ago
Hi Mark,
Thanks for the detailed email, and a big thanks for going to the
effort of the probe-remapping -- something that's been on my todo list
for far too long.
Can you please elaborate (or point me to a doc) on your probe mapping
process? transcript to gene redundancy is the big issue here, which
CTNND1 suffers from.
What are your thoughts on a best guess strategy when there's
ambiguity. If CTNND1 probes map to 1500 and 100528016, my vote is
generally to choose the oldest record, since
1500 = CTNND1 catenin (cadherin-associated protein), delta 1 [ Homo
sapiens ]
100528016 = TMX2-CTNND1 TMX2-CTNND1 readthrough (non-protein coding) [
Homo sapiens ]
... however, some additional coding that checks sequence identity
among clashes may help resolve conflicts. toggleProbes will provide
much of the raw data, but then there's lots of downstream work to re-
do which the AnnotationDBi pipeline (and you) have already done.
cheers,
Mark
On 27/09/2011, at 11:05 PM, Mark Dunning wrote:
> Hi Mark,
>
> Thanks for pointing out this issue, as it does deserve more
> clarification. The Refseq IDs used for the package do not come
> directly from the Illumina manifest file. Rather we have taken the
> probe sequences and done a re-mapping to the genome and
transcriptome.
> The RefSeq IDs that we assign during this re-mapping are the basis
for
> a set of standard mappings provided by the AnnotationDBi
> infrastructure.
>
> However, as far as I know, probes that map to multiple EntrezIDs are
> automatically filtered out. You can use the toggleProbes function to
> change the usual mapping to return all return all values.
>
>> allEGs = toggleProbes(illuminaHumanv4ENTREZID, "all")
>
>> mget(ids, allEGs)
> $ILMN_1651944
> [1] NA
>
> $ILMN_1807510
> [1] NA
>
> $ILMN_1696806
> [1] "100528016" "1500"
>
> $ILMN_1663159
> [1] NA
>
> $ILMN_2293511
> [1] "100528016" "1500"
>
> So two of the probes *do* have mappings, but they do not get mapped
to
> gene symbols because there is not one unique EntrezID.
>
> Aside from the usual Bioconductor mappings, we have added other
> information collected during our re-annotation to the package. Of
most
> interest here is the Probe Quality score and Coding Zone.
>
>> unlist(mget(ids, illuminaHumanv4PROBEQUALITY))
> ILMN_1651944 ILMN_1807510 ILMN_1696806 ILMN_1663159 ILMN_2293511
> "Bad" "No match" "Perfect" "Bad" "Perfect"
>
>> unlist(mget(ids, illuminaHumanv4CODINGZONE))
> ILMN_1651944 ILMN_1807510 ILMN_1696806 ILMN_1663159 ILMN_2293511
> "Intronic" NA "5pUTR" "Intronic" "5pUTR"
>
> So one probe doesn't match to any part of the genome, two map to
> introns and the other two uniqely map to a genomic location, but at
> the 5' end of a gene. We did do our own mapping to Gene Symbol
> (independent to the mapping done by Bioconductor). which would
> correctly assign these probes to CTNND1. However, these mappings are
> not currently part of the released packages. We plan to include them
> in the next release though.
>
> Best wishes,
>
> Mark
>
> On Thu, Sep 22, 2011 at 10:58 AM, Mark Cowley
<m.cowley@garvan.org.au> wrote:
>> Dear list,
>> I've read the illuminaHumanv4.db.pdf, and it's not clear to me how
the mappings are built. From the short package description, I thought
the RefSeq ID's from the illumina array manifest would be used, but
according to the pdf manual, I think its ACCNUM, but we're not told
from where the ACCNUM is derived (from ?illuminaHumanv4ACCNUM: "For
chip packages such as this, the ACCNUM mapping comes directly from the
manufacturer.").
>>
>> I raise the question, since within the illuminaHuman4SYMBOL table,
there are no probes for the CTNND1 gene, whereas according to the
manifest file, there are 5 probes that should map to that gene:
>>
>> from the manifest:
>> $ grep -w CTNND1 HumanHT-12_V4_0_R2_15002873_B.txt | cut -f3,6,5,14
>> #Search_Key ILMN_Gene RefSeq_ID Symbol
>> XM_943087.1 CTNND1 XM_943087.1 ILMN_1651944
>> XM_937008.1 CTNND1 XM_937008.1 ILMN_1807510
>> XM_943098.1 CTNND1 NM_001085458.1 ILMN_1696806
>> XM_943098.1 CTNND1 XM_943098.1 ILMN_1663159
>> NM_001331.1 CTNND1 NM_001331.1 ILMN_2293511
>>
>> # from the illuminaHumanv4.db package
>> require(illuminaHumanv4.db)
>>> ids <- c("ILMN_1651944", "ILMN_1807510", "ILMN_1696806",
"ILMN_1663159", "ILMN_2293511")
>>> unlist(mget(ids, illuminaHumanv4SYMBOL))
>> ILMN_1651944 ILMN_1807510 ILMN_1696806 ILMN_1663159 ILMN_2293511
>> NA NA NA NA NA
>>> unlist(mget(ids, illuminaHumanv4REFSEQ))
>> ILMN_1651944 ILMN_1807510 ILMN_1696806 ILMN_1663159 ILMN_2293511
>> NA NA NA NA NA
>> # why are there no REFSEQID's for these probes?
>>
>>> mget(ids, illuminaHumanv4ACCNUM)
>> $ILMN_1651944
>> [1] NA
>> $ILMN_1807510
>> [1] NA
>> $ILMN_1696806
>> [1] "NM_001085458" "NM_001085459" "NM_001085460" "NM_001085461"
"NM_001085462"
>> [6] "NM_001085463" "NM_001085464" "NM_001085465" "NM_001085466"
"NM_001085467"
>> [11] "NM_001085468" "NM_001085469" "NM_001331" "NR_037646"
>> $ILMN_1663159
>> [1] NA
>> $ILMN_2293511
>> [1] "NM_001085458" "NM_001085459" "NM_001085460" "NM_001085461"
"NM_001085462"
>> [6] "NM_001085463" "NM_001085464" "NM_001085465" "NM_001085466"
"NM_001085467"
>> [11] "NM_001085468" "NM_001085469" "NM_001331" "NR_037646"
>>
>> # all of these RefSeq ID's correspond to Entrez Gene ID 1500,
CTNND1 catenin (cadherin-associated protein), delta 1 [ Homo sapiens ]
>> # why do 3 probes not have an ACCNUM?
>>
>>
>> If I BLAST all 5 probes, the 3 probes with NA in the ACCNUM (see
above) all align to NG_029078.1 (=CTNND1), but not to NM_001331
(=CTNND1), and the 2 probes with lots of ACCNUM ID's align to both
NG_029078.1 and NM_001331 amongst many others.
>> mget(ids, illuminaHumanv4PROBESEQUENCE)
>>> ILMN_1651944 -> NG_029078.1
>> GAAGGACCCTCCCCCGCTTCATAGTTTATGAATGCGAGAGTTGGTAAGGG
>>> ILMN_1807510 -> NG_029078.1
>> CGGTCATTCTCTGCCATCCCTAGAAAGAATGTCCAATCCACTGCCTTTGT
>>> ILMN_1696806 -> NG_029078.1, NM_001331, many others
>> GACCATCCCAAAAAGGAAGTGCACCTTGGAGCCTGTGGAGCTCTCAAGAA
>>> ILMN_1663159 -> NG_029078.1
>> GCCTATTCTTTAGCCTCCATTCCTATCTGTATTGCATACTGTAACTCCAA
>>> ILMN_2293511 -> NG_029078.1, NM_001331, many others
>> ATCCAGACTTTGGGTCGTGATTTCCGCAAGAATGGCAATGGGGGACCTGG
>>
>>
>>
>> I'd really love to get to the bottom of this, as the R annotation
packages are very rich, but missing ID's make it hard to know whether
they're better than the manufacturers manifest files.
>>
>> cheers,
>> Mark
>> -----------------------------------------------------
>> Mark Cowley, PhD
>>
>> Pancreatic Cancer Program | Peter Wills Bioinformatics Centre
>> Garvan Institute of Medical Research, Sydney, Australia
>> -----------------------------------------------------
>>
>>
>>> sessionInfo()
>> R version 2.13.1 (2011-07-08)
>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>>
>> locale:
>> [1] en_AU.UTF-8/en_AU.UTF-8/C/C/en_AU.UTF-8/en_AU.UTF-8
>>
>> attached base packages:
>> [1] graphics datasets grDevices utils grid stats
methods
>> [8] base
>>
>> other attached packages:
>> [1] illuminaHumanv4.db_1.10.0 org.Hs.eg.db_2.5.0
>> [3] RSQLite_0.9-4 DBI_0.2-5
>> [5] AnnotationDbi_1.14.1 limma_3.8.3
>> [7] mjcdev_1.0 Cairo_1.4-9
>> [9] metaGSEA_1.0.2 pwbc_1.0.3
>> [11] lumidat_1.0.1 lumi_2.4.0
>> [13] nleqslv_1.8.6 updateR_1.0.4
>> [15] roxygen_0.1-3 digest_0.5.0
>> [17] codetools_0.2-8 haselst_0.1
>> [19] blat_0.1 genomics_0.1
>> [21] mjcbase_0.1 GEOquery_2.19.2
>> [23] cor_0.1 xtable_1.5-6
>> [25] rgl_0.92.798 qvalue_1.26.0
>> [27] igraph_0.5.5-2 graph_1.30.0
>> [29] XML_3.4-2 SparseM_0.89
>> [31] Biobase_2.12.2 sos_1.3-1
>> [33] brew_1.0-6 gplots_2.8.0
>> [35] caTools_1.12 bitops_1.0-4.1
>> [37] gdata_2.8.1 gtools_2.6.2
>>
>> loaded via a namespace (and not attached):
>> [1] affy_1.30.0 affyio_1.20.0 annotate_1.30.0
>> [4] hdrcde_2.15 KernSmooth_2.23-6 lattice_0.19-30
>> [7] MASS_7.3-13 Matrix_0.999375-50 methylumi_1.8.0
>> [10] mgcv_1.7-6 nlme_3.1-101
preprocessCore_1.14.0
>> [13] RCurl_1.6-7 tcltk_2.13.1 tools_2.13.1
>>
>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor@r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
[[alternative HTML version deleted]]