matching Entrez-IDs to Affy probesets using biomaRt

0

Entering edit mode

Naomi Altman ★ 6.0k

@naomi-altman-380

Last seen 3.0 years ago

United States

After my premature posting yesterday, I am bit hesitant to ask, but I am puzzled by what I am getting from biomaRt. (To avoid clutter, I added the sessionInfo at the end of the message.) I used ReadAffy() to read in a rat dataset and called it CELdata. CELdata AffyBatch object size of arrays=834x834 features (19 kb) cdf=Rat230_2 (31099 affyids) number of samples=8 number of genes=31099 annotation=rat2302 notes= features=featureNames(CELdata) >length(features) [1] 31099 >sumis.na(features)) [1] 0 I use features to query biomaRt for the Entrez-ids. I got back only 18882 probesets (but actually fewer, because some probesets are matched to 2 Entrez-ids). On the other hand, some of the Affy-ids there were returned did not match anything, so I am not sure why they were returned. matchFeature=getBM(attributes=c('affy_rat230_2','entrezgene'), filters ='affy_rat230_2', values = features, mart = ensembl) >dim(matchFeature) [1] 18882 2 >sum(!is.na(matchFeature$affy_rat230_2)) [1] 18882 >sum(!is.na(matchFeature$entrezgene)) [1] 17814 I then use the non-missing Entrez-ids to query biomaRt for the Affy- ids. I got back only 18249 Entrez-ids (presumable because some Entrez-ids are matched to 2 probesets). Nothing is missing. matchEntrez=getBM(attributes=c('affy_rat230_2','entrezgene'), filters ='entrezgene', values = matchFeature[!is.na(matchFeature[,2]),2], mart = ensembl) >dim(matchEntrez) [1] 18249 2 >sum(!is.na(matchEntrez[,1])) [1] 18249 >sum(!is.na(matchEntrez[,2])) [1] 18249 I am pretty sure that the discrepancies in the counts has to do with how getBM is handling multiple matches. length(unique(matchFeature[,1])) [1] 16851 >length(unique(matchEntrez[,1])) [1] 16143 >length(unique(matchFeature[,2])) [1] 13738 >length(unique(matchEntrez[,2])) [1] 13737 >length(unique(matchFeature[!is.na(matchFeature[,2]),1])) [1] 16142 In any case, I seem to be missing about 13000 probesets. Surely there cannot be that many probesets on the array with no Entrez-id? Thanks for any help you can provide. Naomi Altman >sessionInfo() R version 3.0.2 (2013-09-25) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] parallel stats graphics grDevices utils datasets methods base other attached packages: [1] rat2302cdf_2.13.0 hgu95av2cdf_2.13.0 AnnotationDbi_1.24.0 biomaRt_2.18.0 [5] edgeR_3.4.2 limma_3.18.13 affy_1.40.0 Biobase_2.22.0 [9] BiocGenerics_0.8.0 loaded via a namespace (and not attached): [1] affyio_1.30.0 BiocInstaller_1.12.0 DBI_0.2-7 IRanges_1.20.7 [5] preprocessCore_1.24.0 RCurl_1.95-4.1 RSQLite_0.11.4 stats4_3.0.2 [9] tools_3.0.2 XML_3.98-1.1 zlibbioc_1.8.0 [[alternative HTML version deleted]]

biomaRt biomaRt • 1.2k views

ADD COMMENT • link updated 10.1 years ago by Marc Carlson ★ 7.2k • written 10.1 years ago by Naomi Altman ★ 6.0k

1

Entering edit mode

Marc Carlson ★ 7.2k

@marc-carlson-2264

Last seen 7.7 years ago

United States

Hi Naomi, I don't have an answer for what biomaRt is doing here (although I bet that they will have some kind of explanation). But if you just need to do some quick annotation there is also a bioconductor package for that platform that you can use called 'rat2302.db' library(rat2302.db) length(keys(rat2302.db, keytype='PROBEID')) Shows that it has 31099 probeset ids. Then to annotate some probes you could do it like this: probes <- head(keys(rat2302.db, keytype='PROBEID')) select(rat2302.db, keys=probes, columns=c('SYMBOL','GENENAME'), keytype='PROBEID') And just in case you are currently only acclimated to biomaRt, you can learn more about how to use this package here: http://www.bioconductor.org/packages/devel/bioc/vignettes/AnnotationDb i/inst/doc/IntroToAnnotationPackages.pdf Marc On 03/13/2014 02:01 PM, Naomi Altman wrote: > After my premature posting yesterday, I am bit hesitant to ask, but I am > puzzled by what I am getting from biomaRt. (To avoid clutter, I added > the sessionInfo at the end of the message.) > > I used ReadAffy() to read in a rat dataset and called it CELdata. > > CELdata > AffyBatch object > size of arrays=834x834 features (19 kb) > cdf=Rat230_2 (31099 affyids) > number of samples=8 > number of genes=31099 > annotation=rat2302 > notes= > > features=featureNames(CELdata) >> length(features) > [1] 31099 >> sumis.na(features)) > [1] 0 > > I use features to query biomaRt for the Entrez-ids. I got back only 18882 probesets (but actually fewer, because some probesets are matched to 2 Entrez-ids). On the other hand, some of the Affy-ids there were returned did not match anything, so I am not sure why they were returned. > > matchFeature=getBM(attributes=c('affy_rat230_2','entrezgene'), filters ='affy_rat230_2', values = features, mart = ensembl) >> dim(matchFeature) > [1] 18882 2 >> sum(!is.na(matchFeature$affy_rat230_2)) > [1] 18882 >> sum(!is.na(matchFeature$entrezgene)) > [1] 17814 > > > I then use the non-missing Entrez-ids to query biomaRt for the Affy-ids. I got back only 18249 Entrez-ids (presumable because some Entrez-ids are matched to 2 probesets). Nothing is missing. > > > > matchEntrez=getBM(attributes=c('affy_rat230_2','entrezgene'), filters ='entrezgene', values = matchFeature[!is.na(matchFeature[,2]),2], mart = ensembl) > >> dim(matchEntrez) > [1] 18249 2 >> sum(!is.na(matchEntrez[,1])) > [1] 18249 >> sum(!is.na(matchEntrez[,2])) > [1] 18249 > > > I am pretty sure that the discrepancies in the counts has to do with > how getBM is handling multiple matches. > > length(unique(matchFeature[,1])) > [1] 16851 >> length(unique(matchEntrez[,1])) > [1] 16143 >> length(unique(matchFeature[,2])) > [1] 13738 >> length(unique(matchEntrez[,2])) > [1] 13737 >> length(unique(matchFeature[!is.na(matchFeature[,2]),1])) > [1] 16142 > > > > In any case, I seem to be missing about 13000 probesets. Surely there > cannot be that many probesets on the array with no Entrez-id? > > Thanks for any help you can provide. > > Naomi Altman > > >> sessionInfo() > R version 3.0.2 (2013-09-25) > Platform: x86_64-w64-mingw32/x64 (64-bit) > > locale: > [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 > [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C > [5] LC_TIME=English_United States.1252 > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods base > > other attached packages: > [1] rat2302cdf_2.13.0 hgu95av2cdf_2.13.0 AnnotationDbi_1.24.0 biomaRt_2.18.0 > [5] edgeR_3.4.2 limma_3.18.13 affy_1.40.0 Biobase_2.22.0 > [9] BiocGenerics_0.8.0 > > loaded via a namespace (and not attached): > [1] affyio_1.30.0 BiocInstaller_1.12.0 DBI_0.2-7 IRanges_1.20.7 > [5] preprocessCore_1.24.0 RCurl_1.95-4.1 RSQLite_0.11.4 stats4_3.0.2 > [9] tools_3.0.2 XML_3.98-1.1 zlibbioc_1.8.0 > > > > > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 10.1 years ago Marc Carlson ★ 7.2k

Login before adding your answer.