BiomaRt Ensembl RefSeq query error

0

Entering edit mode

Georg Otto ▴ 120

@georg-otto-6333

Last seen 5.4 years ago

United Kingdom

Dear Bioconductors, I am trying to query 14005 Ensembl gene IDs for their Refseq annotations using this code (I can send the gene IDs upon request): ensembl <- useMart("ensembl", dataset = 'mmusculus_gene_ensembl') getBM(attributes = c("ensembl_gene_id", "refseq_mrna"), filter="ensembl_gene_id", ensembl.ids, mart = ensembl, uniqueRows = TRUE) If I query for the full gene set, many RefSeq IDs are missing (NA), for example for the gene ENSMUSG00000000567 (sox9), whereas if I query for a subset, say ensembl.ids[1:12000], all the RefSeq IDs are there. It does not seem to matter which subset I use, but the size of the subset has to be smaller than ca. 12000 genes. Any idea what is going on? Best wishes, Georg

• 1.6k views

ADD COMMENT • link 10.2 years ago Georg Otto ▴ 120

0

Entering edit mode

Georg Otto ▴ 120

@georg-otto-6333

Last seen 5.4 years ago

United Kingdom

as an amendment to my previous post, here is the sessionInfo(): R version 3.0.1 (2013-05-16) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] biomaRt_2.18.0 loaded via a namespace (and not attached): [1] annotate_1.40.0 AnnotationDbi_1.24.0 Biobase_2.22.0 [4] BiocGenerics_0.8.0 compiler_3.0.1 DBI_0.2-7 [7] DESeq_1.14.0 genefilter_1.44.0 geneplotter_1.40.0 [10] grid_3.0.1 IRanges_1.20.6 lattice_0.20-24 [13] parallel_3.0.1 RColorBrewer_1.0-5 RCurl_1.95-4.1 [16] RSQLite_0.11.4 splines_3.0.1 stats4_3.0.1 [19] survival_2.37-4 tools_3.0.1 XML_3.98-1.1 [22] xtable_1.7-1 Georg Otto <georg.otto at="" imm.ox.ac.uk=""> writes: > Dear Bioconductors, > > I am trying to query 14005 Ensembl gene IDs for their Refseq annotations > using this code (I can send the gene IDs upon request): > > ensembl <- useMart("ensembl", dataset = 'mmusculus_gene_ensembl') > > getBM(attributes = c("ensembl_gene_id", > "refseq_mrna"), filter="ensembl_gene_id", > ensembl.ids, > mart = ensembl, uniqueRows = TRUE) > > > If I query for the full gene set, many RefSeq IDs are missing (NA), for > example for the gene ENSMUSG00000000567 (sox9), whereas if I query for a > subset, say ensembl.ids[1:12000], all the RefSeq IDs are there. It does > not seem to matter which subset I use, but the size of the subset has to > be smaller than ca. 12000 genes. > > Any idea what is going on? > > Best wishes, > > Georg > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 10.2 years ago Georg Otto ▴ 120

0

Entering edit mode

Georg, Using your code and calling for only "ENSMUSG00000000567" does not result in NA for me, as you can see: library(biomaRt) ensembl <- useMart("ensembl", dataset = 'mmusculus_gene_ensembl') getBM(attributes = c("ensembl_gene_id","refseq_mrna"), filter="ensembl_gene_id", "ENSMUSG00000000567",mart = ensembl, uniqueRows = TRUE) ensembl_gene_id refseq_mrna 1 ENSMUSG00000000567 NM_011448 You are running R 3.0.1 just like me, but your biomaRt is 2.18 (I'm running 2.16, see below). biomaRt 2.18 is part of BioC 2.13, which is meant for R 3.0.2 as noted here: http://www.bioconductor.org/install/ That is the most likely cause. Wade sessionInfo() R version 3.0.1 (2013-05-16) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] biomaRt_2.16.0 loaded via a namespace (and not attached): [1] RCurl_1.95-4.1 XML_3.98-1.1 -----Original Message----- From: Georg Otto [mailto:georg.otto@imm.ox.ac.uk] Sent: Tuesday, January 21, 2014 6:49 AM To: bioconductor at stat.math.ethz.ch Subject: Re: [BioC] BiomaRt Ensembl RefSeq query error as an amendment to my previous post, here is the sessionInfo(): R version 3.0.1 (2013-05-16) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] biomaRt_2.18.0 loaded via a namespace (and not attached): [1] annotate_1.40.0 AnnotationDbi_1.24.0 Biobase_2.22.0 [4] BiocGenerics_0.8.0 compiler_3.0.1 DBI_0.2-7 [7] DESeq_1.14.0 genefilter_1.44.0 geneplotter_1.40.0 [10] grid_3.0.1 IRanges_1.20.6 lattice_0.20-24 [13] parallel_3.0.1 RColorBrewer_1.0-5 RCurl_1.95-4.1 [16] RSQLite_0.11.4 splines_3.0.1 stats4_3.0.1 [19] survival_2.37-4 tools_3.0.1 XML_3.98-1.1 [22] xtable_1.7-1 Georg Otto <georg.otto at="" imm.ox.ac.uk=""> writes: > Dear Bioconductors, > > I am trying to query 14005 Ensembl gene IDs for their Refseq > annotations using this code (I can send the gene IDs upon request): > > ensembl <- useMart("ensembl", dataset = 'mmusculus_gene_ensembl') > > getBM(attributes = c("ensembl_gene_id", > "refseq_mrna"), filter="ensembl_gene_id", > ensembl.ids, > mart = ensembl, uniqueRows = TRUE) > > > If I query for the full gene set, many RefSeq IDs are missing (NA), > for example for the gene ENSMUSG00000000567 (sox9), whereas if I query > for a subset, say ensembl.ids[1:12000], all the RefSeq IDs are there. > It does not seem to matter which subset I use, but the size of the > subset has to be smaller than ca. 12000 genes. > > Any idea what is going on? > > Best wishes, > > Georg > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 10.2 years ago Davis, Wade ▴ 350

0

Entering edit mode

Georg Otto ▴ 120

@georg-otto-6333

Last seen 5.4 years ago

United Kingdom

Dear Wade, thank you very much for your response. I reverted my installation to biomaRt_2.16.0, but the problem persists. I added some example code below that supports the conlusion that the problem is caused by the refseq query, and by the number of genes queried, not by specific genes. It not only concerns the sox9 gene, but al ot of other genes too. Unfortunately I can not send attachments her, so I will gladly send the file with the Ensembl gene IDs upon request. Best wishes, Georg library(biomaRt) ensembl <- useMart("ensembl", dataset = 'mmusculus_gene_ensembl') ensembl.id <- read.table(file = "ensembl-id.txt") ## the sox9 gene is at position 68 whichensembl.id[,1] == "ENSMUSG00000000567") ## [1] 68 ## query the first 1000 genes ensembl.df <- getBM(attributes = c("ensembl_gene_id", "refseq_mrna", "mgi_symbol", "description"), filter="ensembl_gene_id", ensembl.id[1:1000,1], mart = ensembl, uniqueRows = TRUE) ## sox9 is there ensembl.df[which(ensembl.df$ensembl_gene_id == "ENSMUSG00000000567"),] ## ensembl_gene_id refseq_mrna mgi_symbol 141 ENSMUSG00000000567 ## NM_011448 Sox9 ## description 141 ## SRY-box containing gene 9 [Source:MGI Symbol;Acc:MGI:98371] ## query all the genes ensembl.df <- getBM(attributes = c("ensembl_gene_id", "refseq_mrna", "mgi_symbol", "description"), filter="ensembl_gene_id", ensembl.id, mart = ensembl, uniqueRows = TRUE) ## sox9 is missing ensembl.df[which(ensembl.df$ensembl_gene_id == "ENSMUSG00000000567"),] ## [1] ensembl_gene_id refseq_mrna mgi_symbol description <0 rows> (or ## 0-length row.names) ## genes 1:12262, sox9 is included ensembl.df <- getBM(attributes = c("ensembl_gene_id", "refseq_mrna", "mgi_symbol", "description"), filter="ensembl_gene_id", ensembl.id[1:12262,], mart = ensembl, uniqueRows = TRUE) ensembl.df[which(ensembl.df$ensembl_gene_id == "ENSMUSG00000000567"),] ## ensembl_gene_id refseq_mrna mgi_symbol 141 ENSMUSG00000000567 ## NM_011448 Sox9 ## description 141 ## SRY-box containing gene 9 [Source:MGI Symbol;Acc:MGI:98371] ## genes 1:12263, sox9 is not included ensembl.df <- getBM(attributes = c("ensembl_gene_id", "refseq_mrna", "mgi_symbol", "description"), filter="ensembl_gene_id", ensembl.id[1:12263,], mart = ensembl, uniqueRows = TRUE) ensembl.df[which(ensembl.df$ensembl_gene_id == "ENSMUSG00000000567"),] ## [1] ensembl_gene_id refseq_mrna mgi_symbol description ## <0 rows> (or 0-length row.names) ## but sox9 is included when refseq is omitted ensembl.df <- getBM(attributes = c("ensembl_gene_id", # "refseq_mrna", "mgi_symbol", "description"), filter="ensembl_gene_id", ensembl.id[1:12263,], mart = ensembl, uniqueRows = TRUE) ensembl.df[which(ensembl.df$ensembl_gene_id == "ENSMUSG00000000567"),] ## ensembl_gene_id mgi_symbol 68 ENSMUSG00000000567 Sox9 ## description 68 ## SRY-box containing gene 9 [Source:MGI Symbol;Acc:MGI:98371] ## the problem is not due to ensembl id #12263, because here sox 9 is present ensembl.df <- getBM(attributes = c("ensembl_gene_id", "refseq_mrna", "mgi_symbol", "description"), filter="ensembl_gene_id", ensembl.id[c(1:382,1400:nrowensembl.id)),], mart = ensembl, uniqueRows = TRUE) ensembl.df[which(ensembl.df$ensembl_gene_id == "ENSMUSG00000000567"),] ## ensembl_gene_id refseq_mrna mgi_symbol 141 ENSMUSG00000000567 ## NM_011448 Sox9 ## description 141 ## SRY-box containing gene 9 [Source:MGI Symbol;Acc:MGI:98371] ## but one more gene, and sox 9 is missing ensembl.df <- getBM(attributes = c("ensembl_gene_id", "refseq_mrna", "mgi_symbol", "description"), filter="ensembl_gene_id", ensembl.id[c(1:383,1400:nrowensembl.id)),], mart = ensembl, uniqueRows = TRUE) ensembl.df[which(ensembl.df$ensembl_gene_id == "ENSMUSG00000000567"),] ## [1] ensembl_gene_id refseq_mrna mgi_symbol description <0 rows> (or ## 0-length row.names) sessionInfo() ## R version 3.0.1 (2013-05-16) Platform: x86_64-unknown-linux-gnu ## (64-bit) ## locale: ## [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB.UTF-8 ## LC_COLLATE=en_GB.UTF-8 [5] LC_MONETARY=en_GB.UTF-8 ## LC_MESSAGES=en_GB.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C ## LC_TELEPHONE=C ## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C ## attached base packages: [1] stats graphics grDevices utils datasets ## methods base ## other attached packages: [1] biomaRt_2.16.0 ## loaded via a namespace (and not attached): [1] compiler_3.0.1 ## RCurl_1.95-4.1 tools_3.0.1 XML_3.98-1.1 "Davis, Wade" <davisjwa at="" health.missouri.edu=""> writes: > Georg, Using your code and calling for only "ENSMUSG00000000567" does > not result in NA for me, as you can see: > > library(biomaRt) ensembl <- useMart("ensembl", dataset = > 'mmusculus_gene_ensembl') getBM(attributes = > c("ensembl_gene_id","refseq_mrna"), filter="ensembl_gene_id", > "ENSMUSG00000000567",mart = ensembl, uniqueRows = > TRUE) > > ensembl_gene_id refseq_mrna 1 ENSMUSG00000000567 NM_011448 > > You are running R 3.0.1 just like me, but your biomaRt is 2.18 (I'm > running 2.16, see below). biomaRt 2.18 is part of BioC 2.13, which is > meant for R 3.0.2 as noted here: http://www.bioconductor.org/install/ > > That is the most likely cause. > > Wade > > > sessionInfo() R version 3.0.1 (2013-05-16) Platform: > x86_64-w64-mingw32/x64 (64-bit) > > locale: [1] LC_COLLATE=English_United States.1252 > LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United > States.1252 LC_NUMERIC=C [5] LC_TIME=English_United States.1252 > > attached base packages: [1] stats graphics grDevices utils datasets > methods base > > other attached packages: [1] biomaRt_2.16.0 > > loaded via a namespace (and not attached): [1] RCurl_1.95-4.1 > XML_3.98-1.1 > >

ADD COMMENT • link 10.2 years ago Georg Otto ▴ 120

Login before adding your answer.