understanding multiples matches between probesets and entrezgene (biomart)
2
0
Entering edit mode
Juliet Hannah ▴ 360
@juliet-hannah-4531
Last seen 5.0 years ago
United States
All, I understand the concept of multiple probesets corresponding to one identifier. But what is the meaning of a probeset corresponding to multiple identifiers? And below, given that 220547_s_at has a match, why should another row be returned with NA. Did I happen to choose a few probesets where the gene definition is changing, or am I misunderstanding something else, such as the biomart syntax. Thanks, Juliet library("biomaRt") probeSets <- c("219666_at", "220547_s_at", "218034_at") ensembl = useMart("ensembl") ensembl = useDataset("hsapiens_gene_ensembl",mart=ensembl) getBM(attributes = c("affy_hg_u133a", "entrezgene"), filters = "affy_hg_u133a",values = probeSets, mart = ensembl) affy_hg_u133a entrezgene 1 220547_s_at 54537 2 218034_at 51024 3 220547_s_at NA 4 219666_at 64231 5 220547_s_at 414241 6 220547_s_at 439965 > sessionInfo() R version 2.15.0 (2012-03-30) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] biomaRt_2.12.0 BiocInstaller_1.4.6 loaded via a namespace (and not attached): [1] RCurl_1.91-1 tools_2.15.0 XML_3.9-4
biomaRt biomaRt • 728 views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 5 hours ago
United States
Hi Juliet, On 6/13/2012 11:01 AM, Juliet Hannah wrote: > All, > > I understand the concept of multiple probesets corresponding to one > identifier. But what is the meaning of > a probeset corresponding to multiple identifiers? And below, given > that 220547_s_at has a match, > why should another row be returned with NA. > > Did I happen to choose a few probesets where the gene definition is > changing, or am I misunderstanding > something else, such as the biomart syntax. I'm not sure about the NA being returned. That probably has something to do with how the Biomart database is set up. As for the multiple genes per probeset, this has to do with the fact that a 25-mer isn't really long enough to distinguish between genes with relatively high homology. This is supposed to be reflected in the probeset ID, although things have changed quite a bit since UniGene build 133. The probeset you are showing below has a _s_at identifier, which indicates that it cross-hybridizes to multiple members of a related gene family (in this case the FAM35 gene family). There are other identifiers like the _x_at which indicates cross-hybridization to unrelated genes. http://www.affymetrix.com/support/help/faqs/hgu133/index.jsp Best, Jim > > Thanks, > > Juliet > > library("biomaRt") > probeSets<- c("219666_at", "220547_s_at", "218034_at") > ensembl = useMart("ensembl") > ensembl = useDataset("hsapiens_gene_ensembl",mart=ensembl) > > getBM(attributes = c("affy_hg_u133a", "entrezgene"), filters = > "affy_hg_u133a",values = probeSets, mart = ensembl) > > > affy_hg_u133a entrezgene > 1 220547_s_at 54537 > 2 218034_at 51024 > 3 220547_s_at NA > 4 219666_at 64231 > 5 220547_s_at 414241 > 6 220547_s_at 439965 > > > > >> sessionInfo() > R version 2.15.0 (2012-03-30) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] biomaRt_2.12.0 BiocInstaller_1.4.6 > > loaded via a namespace (and not attached): > [1] RCurl_1.91-1 tools_2.15.0 XML_3.9-4 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099
ADD COMMENT
0
Entering edit mode
@steve-lianoglou-2771
Last seen 14 months ago
United States
Hi, On Wed, Jun 13, 2012 at 11:01 AM, Juliet Hannah <juliet.hannah at="" gmail.com=""> wrote: > All, > > I understand the concept of multiple probesets corresponding to one > identifier. But what is the meaning of > a probeset corresponding to multiple identifiers? ?And below, given > that 220547_s_at has a match, > why should another row be returned with NA. [snip] Given the output from the entrez IDs you entered (below, in remaining quoted text), the duplicate entrez for the same probesets map to these entrez ids: http://www.ncbi.nlm.nih.gov/gene?term=414241 http://www.ncbi.nlm.nih.gov/gene?term=54537 http://www.ncbi.nlm.nih.gov/gene?term=439965 They're all w/in the same family and there is at least one pseudo gene -- in their "Gene description" field, they all mention that they have "high sequence similarity 35" Given that information, I guess we can take a guess as to why this might be happening. You might consider looking into the CDFs the "brainarray" people are publishing to perhaps avoid these probes altogether: http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCDF/gen omic_curated_CDF.asp Not sure about the NA part of your question ... HTH, -steve > > Did I happen to choose a few probesets where the gene definition is > changing, or am I misunderstanding > something else, such as the biomart syntax. > > Thanks, > > Juliet > > library("biomaRt") > probeSets <- c("219666_at", "220547_s_at", "218034_at") > ensembl = useMart("ensembl") > ensembl = useDataset("hsapiens_gene_ensembl",mart=ensembl) > > getBM(attributes = c("affy_hg_u133a", "entrezgene"), filters = > "affy_hg_u133a",values = probeSets, mart = ensembl) > > > ?affy_hg_u133a entrezgene > 1 ? 220547_s_at ? ? ?54537 > 2 ? ? 218034_at ? ? ?51024 > 3 ? 220547_s_at ? ? ? ? NA > 4 ? ? 219666_at ? ? ?64231 > 5 ? 220547_s_at ? ? 414241 > 6 ? 220547_s_at ? ? 439965 > > > > >> sessionInfo() > R version 2.15.0 (2012-03-30) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > ?[1] LC_CTYPE=en_US.UTF-8 ? ? ? LC_NUMERIC=C > ?[3] LC_TIME=en_US.UTF-8 ? ? ? ?LC_COLLATE=en_US.UTF-8 > ?[5] LC_MONETARY=en_US.UTF-8 ? ?LC_MESSAGES=en_US.UTF-8 > ?[7] LC_PAPER=C ? ? ? ? ? ? ? ? LC_NAME=C > ?[9] LC_ADDRESS=C ? ? ? ? ? ? ? LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats ? ? graphics ?grDevices utils ? ? datasets ?methods ? base > > other attached packages: > [1] biomaRt_2.12.0 ? ? ?BiocInstaller_1.4.6 > > loaded via a namespace (and not attached): > [1] RCurl_1.91-1 tools_2.15.0 XML_3.9-4 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
ADD COMMENT

Login before adding your answer.

Traffic: 854 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6