Question: matching Entrez-IDs to Affy probesets using biomaRt
0
gravatar for Naomi Altman
5.7 years ago by
Naomi Altman6.0k
Naomi Altman6.0k wrote:
After my premature posting yesterday, I am bit hesitant to ask, but I am puzzled by what I am getting from biomaRt. (To avoid clutter, I added the sessionInfo at the end of the message.) I used ReadAffy() to read in a rat dataset and called it CELdata. CELdata AffyBatch object size of arrays=834x834 features (19 kb) cdf=Rat230_2 (31099 affyids) number of samples=8 number of genes=31099 annotation=rat2302 notes= features=featureNames(CELdata) >length(features) [1] 31099 >sumis.na(features)) [1] 0 I use features to query biomaRt for the Entrez-ids. I got back only 18882 probesets (but actually fewer, because some probesets are matched to 2 Entrez-ids). On the other hand, some of the Affy-ids there were returned did not match anything, so I am not sure why they were returned. matchFeature=getBM(attributes=c('affy_rat230_2','entrezgene'), filters ='affy_rat230_2', values = features, mart = ensembl) >dim(matchFeature) [1] 18882 2 >sum(!is.na(matchFeature$affy_rat230_2)) [1] 18882 >sum(!is.na(matchFeature$entrezgene)) [1] 17814 I then use the non-missing Entrez-ids to query biomaRt for the Affy- ids. I got back only 18249 Entrez-ids (presumable because some Entrez-ids are matched to 2 probesets). Nothing is missing. matchEntrez=getBM(attributes=c('affy_rat230_2','entrezgene'), filters ='entrezgene', values = matchFeature[!is.na(matchFeature[,2]),2], mart = ensembl) >dim(matchEntrez) [1] 18249 2 >sum(!is.na(matchEntrez[,1])) [1] 18249 >sum(!is.na(matchEntrez[,2])) [1] 18249 I am pretty sure that the discrepancies in the counts has to do with how getBM is handling multiple matches. length(unique(matchFeature[,1])) [1] 16851 >length(unique(matchEntrez[,1])) [1] 16143 >length(unique(matchFeature[,2])) [1] 13738 >length(unique(matchEntrez[,2])) [1] 13737 >length(unique(matchFeature[!is.na(matchFeature[,2]),1])) [1] 16142 In any case, I seem to be missing about 13000 probesets. Surely there cannot be that many probesets on the array with no Entrez-id? Thanks for any help you can provide. Naomi Altman >sessionInfo() R version 3.0.2 (2013-09-25) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] parallel stats graphics grDevices utils datasets methods base other attached packages: [1] rat2302cdf_2.13.0 hgu95av2cdf_2.13.0 AnnotationDbi_1.24.0 biomaRt_2.18.0 [5] edgeR_3.4.2 limma_3.18.13 affy_1.40.0 Biobase_2.22.0 [9] BiocGenerics_0.8.0 loaded via a namespace (and not attached): [1] affyio_1.30.0 BiocInstaller_1.12.0 DBI_0.2-7 IRanges_1.20.7 [5] preprocessCore_1.24.0 RCurl_1.95-4.1 RSQLite_0.11.4 stats4_3.0.2 [9] tools_3.0.2 XML_3.98-1.1 zlibbioc_1.8.0 [[alternative HTML version deleted]]
biomart • 797 views
ADD COMMENTlink modified 5.7 years ago by Marc Carlson7.2k • written 5.7 years ago by Naomi Altman6.0k
Answer: matching Entrez-IDs to Affy probesets using biomaRt
1
gravatar for Marc Carlson
5.7 years ago by
Marc Carlson7.2k
United States
Marc Carlson7.2k wrote:
Hi Naomi, I don't have an answer for what biomaRt is doing here (although I bet that they will have some kind of explanation). But if you just need to do some quick annotation there is also a bioconductor package for that platform that you can use called 'rat2302.db' library(rat2302.db) length(keys(rat2302.db, keytype='PROBEID')) Shows that it has 31099 probeset ids. Then to annotate some probes you could do it like this: probes <- head(keys(rat2302.db, keytype='PROBEID')) select(rat2302.db, keys=probes, columns=c('SYMBOL','GENENAME'), keytype='PROBEID') And just in case you are currently only acclimated to biomaRt, you can learn more about how to use this package here: http://www.bioconductor.org/packages/devel/bioc/vignettes/AnnotationDb i/inst/doc/IntroToAnnotationPackages.pdf Marc On 03/13/2014 02:01 PM, Naomi Altman wrote: > After my premature posting yesterday, I am bit hesitant to ask, but I am > puzzled by what I am getting from biomaRt. (To avoid clutter, I added > the sessionInfo at the end of the message.) > > I used ReadAffy() to read in a rat dataset and called it CELdata. > > CELdata > AffyBatch object > size of arrays=834x834 features (19 kb) > cdf=Rat230_2 (31099 affyids) > number of samples=8 > number of genes=31099 > annotation=rat2302 > notes= > > features=featureNames(CELdata) >> length(features) > [1] 31099 >> sumis.na(features)) > [1] 0 > > I use features to query biomaRt for the Entrez-ids. I got back only 18882 probesets (but actually fewer, because some probesets are matched to 2 Entrez-ids). On the other hand, some of the Affy-ids there were returned did not match anything, so I am not sure why they were returned. > > matchFeature=getBM(attributes=c('affy_rat230_2','entrezgene'), filters ='affy_rat230_2', values = features, mart = ensembl) >> dim(matchFeature) > [1] 18882 2 >> sum(!is.na(matchFeature$affy_rat230_2)) > [1] 18882 >> sum(!is.na(matchFeature$entrezgene)) > [1] 17814 > > > I then use the non-missing Entrez-ids to query biomaRt for the Affy-ids. I got back only 18249 Entrez-ids (presumable because some Entrez-ids are matched to 2 probesets). Nothing is missing. > > > > matchEntrez=getBM(attributes=c('affy_rat230_2','entrezgene'), filters ='entrezgene', values = matchFeature[!is.na(matchFeature[,2]),2], mart = ensembl) > >> dim(matchEntrez) > [1] 18249 2 >> sum(!is.na(matchEntrez[,1])) > [1] 18249 >> sum(!is.na(matchEntrez[,2])) > [1] 18249 > > > I am pretty sure that the discrepancies in the counts has to do with > how getBM is handling multiple matches. > > length(unique(matchFeature[,1])) > [1] 16851 >> length(unique(matchEntrez[,1])) > [1] 16143 >> length(unique(matchFeature[,2])) > [1] 13738 >> length(unique(matchEntrez[,2])) > [1] 13737 >> length(unique(matchFeature[!is.na(matchFeature[,2]),1])) > [1] 16142 > > > > In any case, I seem to be missing about 13000 probesets. Surely there > cannot be that many probesets on the array with no Entrez-id? > > Thanks for any help you can provide. > > Naomi Altman > > >> sessionInfo() > R version 3.0.2 (2013-09-25) > Platform: x86_64-w64-mingw32/x64 (64-bit) > > locale: > [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 > [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C > [5] LC_TIME=English_United States.1252 > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods base > > other attached packages: > [1] rat2302cdf_2.13.0 hgu95av2cdf_2.13.0 AnnotationDbi_1.24.0 biomaRt_2.18.0 > [5] edgeR_3.4.2 limma_3.18.13 affy_1.40.0 Biobase_2.22.0 > [9] BiocGenerics_0.8.0 > > loaded via a namespace (and not attached): > [1] affyio_1.30.0 BiocInstaller_1.12.0 DBI_0.2-7 IRanges_1.20.7 > [5] preprocessCore_1.24.0 RCurl_1.95-4.1 RSQLite_0.11.4 stats4_3.0.2 > [9] tools_3.0.2 XML_3.98-1.1 zlibbioc_1.8.0 > > > > > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD COMMENTlink written 5.7 years ago by Marc Carlson7.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 341 users visited in the last hour