matching Entrez-IDs to Affy probesets using biomaRt
1
0
Entering edit mode
Naomi Altman ★ 6.0k
@naomi-altman-380
Last seen 3.7 years ago
United States
After my premature posting yesterday, I am bit hesitant to ask, but I am puzzled by what I am getting from biomaRt. (To avoid clutter, I added the sessionInfo at the end of the message.) I used ReadAffy() to read in a rat dataset and called it CELdata. CELdata AffyBatch object size of arrays=834x834 features (19 kb) cdf=Rat230_2 (31099 affyids) number of samples=8 number of genes=31099 annotation=rat2302 notes= features=featureNames(CELdata) >length(features) [1] 31099 >sumis.na(features)) [1] 0 I use features to query biomaRt for the Entrez-ids. I got back only 18882 probesets (but actually fewer, because some probesets are matched to 2 Entrez-ids). On the other hand, some of the Affy-ids there were returned did not match anything, so I am not sure why they were returned. matchFeature=getBM(attributes=c('affy_rat230_2','entrezgene'), filters ='affy_rat230_2', values = features, mart = ensembl) >dim(matchFeature) [1] 18882 2 >sum(!is.na(matchFeature$affy_rat230_2)) [1] 18882 >sum(!is.na(matchFeature$entrezgene)) [1] 17814 I then use the non-missing Entrez-ids to query biomaRt for the Affy- ids. I got back only 18249 Entrez-ids (presumable because some Entrez-ids are matched to 2 probesets). Nothing is missing. matchEntrez=getBM(attributes=c('affy_rat230_2','entrezgene'), filters ='entrezgene', values = matchFeature[!is.na(matchFeature[,2]),2], mart = ensembl) >dim(matchEntrez) [1] 18249 2 >sum(!is.na(matchEntrez[,1])) [1] 18249 >sum(!is.na(matchEntrez[,2])) [1] 18249 I am pretty sure that the discrepancies in the counts has to do with how getBM is handling multiple matches. length(unique(matchFeature[,1])) [1] 16851 >length(unique(matchEntrez[,1])) [1] 16143 >length(unique(matchFeature[,2])) [1] 13738 >length(unique(matchEntrez[,2])) [1] 13737 >length(unique(matchFeature[!is.na(matchFeature[,2]),1])) [1] 16142 In any case, I seem to be missing about 13000 probesets. Surely there cannot be that many probesets on the array with no Entrez-id? Thanks for any help you can provide. Naomi Altman >sessionInfo() R version 3.0.2 (2013-09-25) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] parallel stats graphics grDevices utils datasets methods base other attached packages: [1] rat2302cdf_2.13.0 hgu95av2cdf_2.13.0 AnnotationDbi_1.24.0 biomaRt_2.18.0 [5] edgeR_3.4.2 limma_3.18.13 affy_1.40.0 Biobase_2.22.0 [9] BiocGenerics_0.8.0 loaded via a namespace (and not attached): [1] affyio_1.30.0 BiocInstaller_1.12.0 DBI_0.2-7 IRanges_1.20.7 [5] preprocessCore_1.24.0 RCurl_1.95-4.1 RSQLite_0.11.4 stats4_3.0.2 [9] tools_3.0.2 XML_3.98-1.1 zlibbioc_1.8.0 [[alternative HTML version deleted]]
biomaRt biomaRt • 1.3k views
ADD COMMENT
1
Entering edit mode
Marc Carlson ★ 7.2k
@marc-carlson-2264
Last seen 8.4 years ago
United States
Hi Naomi, I don't have an answer for what biomaRt is doing here (although I bet that they will have some kind of explanation). But if you just need to do some quick annotation there is also a bioconductor package for that platform that you can use called 'rat2302.db' library(rat2302.db) length(keys(rat2302.db, keytype='PROBEID')) Shows that it has 31099 probeset ids. Then to annotate some probes you could do it like this: probes <- head(keys(rat2302.db, keytype='PROBEID')) select(rat2302.db, keys=probes, columns=c('SYMBOL','GENENAME'), keytype='PROBEID') And just in case you are currently only acclimated to biomaRt, you can learn more about how to use this package here: http://www.bioconductor.org/packages/devel/bioc/vignettes/AnnotationDb i/inst/doc/IntroToAnnotationPackages.pdf Marc On 03/13/2014 02:01 PM, Naomi Altman wrote: > After my premature posting yesterday, I am bit hesitant to ask, but I am > puzzled by what I am getting from biomaRt. (To avoid clutter, I added > the sessionInfo at the end of the message.) > > I used ReadAffy() to read in a rat dataset and called it CELdata. > > CELdata > AffyBatch object > size of arrays=834x834 features (19 kb) > cdf=Rat230_2 (31099 affyids) > number of samples=8 > number of genes=31099 > annotation=rat2302 > notes= > > features=featureNames(CELdata) >> length(features) > [1] 31099 >> sumis.na(features)) > [1] 0 > > I use features to query biomaRt for the Entrez-ids. I got back only 18882 probesets (but actually fewer, because some probesets are matched to 2 Entrez-ids). On the other hand, some of the Affy-ids there were returned did not match anything, so I am not sure why they were returned. > > matchFeature=getBM(attributes=c('affy_rat230_2','entrezgene'), filters ='affy_rat230_2', values = features, mart = ensembl) >> dim(matchFeature) > [1] 18882 2 >> sum(!is.na(matchFeature$affy_rat230_2)) > [1] 18882 >> sum(!is.na(matchFeature$entrezgene)) > [1] 17814 > > > I then use the non-missing Entrez-ids to query biomaRt for the Affy-ids. I got back only 18249 Entrez-ids (presumable because some Entrez-ids are matched to 2 probesets). Nothing is missing. > > > > matchEntrez=getBM(attributes=c('affy_rat230_2','entrezgene'), filters ='entrezgene', values = matchFeature[!is.na(matchFeature[,2]),2], mart = ensembl) > >> dim(matchEntrez) > [1] 18249 2 >> sum(!is.na(matchEntrez[,1])) > [1] 18249 >> sum(!is.na(matchEntrez[,2])) > [1] 18249 > > > I am pretty sure that the discrepancies in the counts has to do with > how getBM is handling multiple matches. > > length(unique(matchFeature[,1])) > [1] 16851 >> length(unique(matchEntrez[,1])) > [1] 16143 >> length(unique(matchFeature[,2])) > [1] 13738 >> length(unique(matchEntrez[,2])) > [1] 13737 >> length(unique(matchFeature[!is.na(matchFeature[,2]),1])) > [1] 16142 > > > > In any case, I seem to be missing about 13000 probesets. Surely there > cannot be that many probesets on the array with no Entrez-id? > > Thanks for any help you can provide. > > Naomi Altman > > >> sessionInfo() > R version 3.0.2 (2013-09-25) > Platform: x86_64-w64-mingw32/x64 (64-bit) > > locale: > [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 > [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C > [5] LC_TIME=English_United States.1252 > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods base > > other attached packages: > [1] rat2302cdf_2.13.0 hgu95av2cdf_2.13.0 AnnotationDbi_1.24.0 biomaRt_2.18.0 > [5] edgeR_3.4.2 limma_3.18.13 affy_1.40.0 Biobase_2.22.0 > [9] BiocGenerics_0.8.0 > > loaded via a namespace (and not attached): > [1] affyio_1.30.0 BiocInstaller_1.12.0 DBI_0.2-7 IRanges_1.20.7 > [5] preprocessCore_1.24.0 RCurl_1.95-4.1 RSQLite_0.11.4 stats4_3.0.2 > [9] tools_3.0.2 XML_3.98-1.1 zlibbioc_1.8.0 > > > > > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD COMMENT

Login before adding your answer.

Traffic: 446 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6