Question: Problem with filtering genes in biomaRt
0
gravatar for Guest User
5.3 years ago by
Guest User12k
Guest User12k wrote:
Hullo I am getting some inconsistent results with my attempts to filter genes by chromosome when retrieving them form "hsapiens_gene_ensembl" if I specify a list of chromosomes to retrieve from I get a different number for some of the chromosomes than if i retrieve all the genes for all chromosomes or if I retrieve the genes from the chromosomes individually. this is my full code of the problem thanks in advance #!/usr/bin/Rscript --vanilla require(biomaRt) .getGenes <- function(chrs=list()) { # use biomart to get ranges for feature hg19 <- useMart("ensembl",dataset="hsapiens_gene_ensembl") if(length(chrs)>0) { getBM(attributes=c("chromosome_name", "start_position", "end_position", "strand", "ensembl_gene_id", "external_gene_id", "gene_biotype"), filter="chromosome_name", values=chrs, mart=hg19) }else{ getBM(attributes=c("chromosome_name", "start_position", "end_position", "strand", "ensembl_gene_id", "external_gene_id", "gene_biotype"), mart=hg19) } } unfiltered<-.getGenes() chrs<-c(unlist(seq(1,22)),"X","Y","MT") filtered<-.getGenes(chrs) chronly<-lapply(chrs, function(x){ length(.getGenes(x)$ensembl_gene_id) }) names(chronly)<-chrs allgenes<-table(unfiltered$chromosome_name) missinggenes<-table(unfiltered$chromosome_name[which( !(unfiltered$ensembl_gene_id %in% filtered$ensembl_gene_id))]) unlist(lapply(chrs, function(x){ totalmissing<-ifelseis.na(missinggenes[x]),0,missinggenes[x]) paste("Missing genes on chr",x,totalmissing,"of", "individually",chronly[[x]],"all",allgenes[x]) })) sessionInfo() -- output of sessionInfo(): Loading required package: biomaRt Loading required package: methods [1] "Missing genes on chr 1 0 of individually 5321 all 5321" [2] "Missing genes on chr 2 0 of individually 3990 all 3990" [3] "Missing genes on chr 3 0 of individually 3043 all 3043" [4] "Missing genes on chr 4 0 of individually 2521 all 2521" [5] "Missing genes on chr 5 0 of individually 2856 all 2856" [6] "Missing genes on chr 6 0 of individually 2906 all 2906" [7] "Missing genes on chr 7 0 of individually 2818 all 2818" [8] "Missing genes on chr 8 607 of individually 2385 all 2385" [9] "Missing genes on chr 9 1892 of individually 2303 all 2303" [10] "Missing genes on chr 10 0 of individually 2216 all 2216" [11] "Missing genes on chr 11 0 of individually 3190 all 3190" [12] "Missing genes on chr 12 0 of individually 2819 all 2819" [13] "Missing genes on chr 13 0 of individually 1217 all 1217" [14] "Missing genes on chr 14 0 of individually 2237 all 2237" [15] "Missing genes on chr 15 0 of individually 2076 all 2076" [16] "Missing genes on chr 16 0 of individually 2360 all 2360" [17] "Missing genes on chr 17 0 of individually 2901 all 2901" [18] "Missing genes on chr 18 0 of individually 1113 all 1113" [19] "Missing genes on chr 19 0 of individually 2917 all 2917" [20] "Missing genes on chr 20 0 of individually 1322 all 1322" [21] "Missing genes on chr 21 0 of individually 720 all 720" [22] "Missing genes on chr 22 0 of individually 1208 all 1208" [23] "Missing genes on chr X 2028 of individually 2414 all 2414" [24] "Missing genes on chr Y 506 of individually 506 all 506" [25] "Missing genes on chr MT 37 of individually 37 all 37" R version 3.0.2 (2013-09-25) Platform: x86_64-apple-darwin10.8.0 (64-bit) locale: [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 attached base packages: [1] methods stats graphics grDevices utils datasets base other attached packages: [1] biomaRt_2.18.0 loaded via a namespace (and not attached): [1] RCurl_1.95-4.1 XML_3.95-0.2 -- Sent via the guest posting facility at bioconductor.org.
biomart • 755 views
ADD COMMENTlink modified 5.3 years ago by Thomas Maurel770 • written 5.3 years ago by Guest User12k
Answer: Problem with filtering genes in biomaRt
0
gravatar for Thomas Maurel
5.3 years ago by
Thomas Maurel770
United Kingdom
Thomas Maurel770 wrote:
Dear Pieta, As you have noticed in your queries, there is a danger of getting truncated results with biomart when querying a big organisms such as human without any filters. To be sure that you get all the genes back I would advise you to filter on each chromosomes individually. Hope this helps, Thomas On 4 Feb 2014, at 17:45, "Pieta Schofield [guest]" <guest@bioconductor.org> wrote: > > Hullo > > I am getting some inconsistent results with my attempts to filter genes by chromosome when retrieving them form "hsapiens_gene_ensembl" > > if I specify a list of chromosomes to retrieve from I get a different number for some of the chromosomes than if i retrieve all the genes for all chromosomes or if I retrieve the genes from the chromosomes individually. > > this is my full code of the problem > > thanks in advance > > #!/usr/bin/Rscript --vanilla > require(biomaRt) > > .getGenes <- function(chrs=list()) > { # use biomart to get ranges for feature > hg19 <- useMart("ensembl",dataset="hsapiens_gene_ensembl") > if(length(chrs)>0) > { > getBM(attributes=c("chromosome_name", > "start_position", > "end_position", > "strand", > "ensembl_gene_id", > "external_gene_id", > "gene_biotype"), > filter="chromosome_name", > values=chrs, > mart=hg19) > }else{ > getBM(attributes=c("chromosome_name", > "start_position", > "end_position", > "strand", > "ensembl_gene_id", > "external_gene_id", > "gene_biotype"), > mart=hg19) > } > } > > unfiltered<-.getGenes() > > chrs<-c(unlist(seq(1,22)),"X","Y","MT") > filtered<-.getGenes(chrs) > > chronly<-lapply(chrs, > function(x){ > length(.getGenes(x)$ensembl_gene_id) > }) > names(chronly)<-chrs > > allgenes<-table(unfiltered$chromosome_name) > missinggenes<-table(unfiltered$chromosome_name[which( > !(unfiltered$ensembl_gene_id > %in% > filtered$ensembl_gene_id))]) > > unlist(lapply(chrs, > function(x){ > totalmissing<-ifelseis.na(missinggenes[x]),0,missinggenes[x]) > paste("Missing genes on chr",x,totalmissing,"of", > "individually",chronly[[x]],"all",allgenes[x]) > })) > > sessionInfo() > > > > > > > -- output of sessionInfo(): > > Loading required package: biomaRt > Loading required package: methods > [1] "Missing genes on chr 1 0 of individually 5321 all 5321" > [2] "Missing genes on chr 2 0 of individually 3990 all 3990" > [3] "Missing genes on chr 3 0 of individually 3043 all 3043" > [4] "Missing genes on chr 4 0 of individually 2521 all 2521" > [5] "Missing genes on chr 5 0 of individually 2856 all 2856" > [6] "Missing genes on chr 6 0 of individually 2906 all 2906" > [7] "Missing genes on chr 7 0 of individually 2818 all 2818" > [8] "Missing genes on chr 8 607 of individually 2385 all 2385" > [9] "Missing genes on chr 9 1892 of individually 2303 all 2303" > [10] "Missing genes on chr 10 0 of individually 2216 all 2216" > [11] "Missing genes on chr 11 0 of individually 3190 all 3190" > [12] "Missing genes on chr 12 0 of individually 2819 all 2819" > [13] "Missing genes on chr 13 0 of individually 1217 all 1217" > [14] "Missing genes on chr 14 0 of individually 2237 all 2237" > [15] "Missing genes on chr 15 0 of individually 2076 all 2076" > [16] "Missing genes on chr 16 0 of individually 2360 all 2360" > [17] "Missing genes on chr 17 0 of individually 2901 all 2901" > [18] "Missing genes on chr 18 0 of individually 1113 all 1113" > [19] "Missing genes on chr 19 0 of individually 2917 all 2917" > [20] "Missing genes on chr 20 0 of individually 1322 all 1322" > [21] "Missing genes on chr 21 0 of individually 720 all 720" > [22] "Missing genes on chr 22 0 of individually 1208 all 1208" > [23] "Missing genes on chr X 2028 of individually 2414 all 2414" > [24] "Missing genes on chr Y 506 of individually 506 all 506" > [25] "Missing genes on chr MT 37 of individually 37 all 37" > R version 3.0.2 (2013-09-25) > Platform: x86_64-apple-darwin10.8.0 (64-bit) > > locale: > [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 > > attached base packages: > [1] methods stats graphics grDevices utils datasets base > > other attached packages: > [1] biomaRt_2.18.0 > > loaded via a namespace (and not attached): > [1] RCurl_1.95-4.1 XML_3.95-0.2 > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Thomas Maurel Bioinformatician - Ensembl Production Team European Bioinformatics Institute (EMBL-EBI) European Molecular Biology Laboratory Wellcome Trust Genome Campus Hinxton Cambridge CB10 1SD United Kingdom [[alternative HTML version deleted]]
ADD COMMENTlink written 5.3 years ago by Thomas Maurel770
Hi Thomas, A "big" query like this has never been a problem before, can you shed some light at what could be causing this? Were there some changes at Ensembl? Best regards, Steffen On Fri, Feb 7, 2014 at 2:16 AM, Thomas Maurel <maurel@ebi.ac.uk> wrote: > Dear Pieta, > > As you have noticed in your queries, there is a danger of getting > truncated results with biomart when querying a big organisms such as human > without any filters. To be sure that you get all the genes back I would > advise you to filter on each chromosomes individually. > > Hope this helps, > Thomas > On 4 Feb 2014, at 17:45, "Pieta Schofield [guest]" <guest@bioconductor.org> > wrote: > > > > > Hullo > > > > I am getting some inconsistent results with my attempts to filter genes > by chromosome when retrieving them form "hsapiens_gene_ensembl" > > > > if I specify a list of chromosomes to retrieve from I get a different > number for some of the chromosomes than if i retrieve all the genes for all > chromosomes or if I retrieve the genes from the chromosomes individually. > > > > this is my full code of the problem > > > > thanks in advance > > > > #!/usr/bin/Rscript --vanilla > > require(biomaRt) > > > > .getGenes <- function(chrs=list()) > > { # use biomart to get ranges for feature > > hg19 <- useMart("ensembl",dataset="hsapiens_gene_ensembl") > > if(length(chrs)>0) > > { > > getBM(attributes=c("chromosome_name", > > "start_position", > > "end_position", > > "strand", > > "ensembl_gene_id", > > "external_gene_id", > > "gene_biotype"), > > filter="chromosome_name", > > values=chrs, > > mart=hg19) > > }else{ > > getBM(attributes=c("chromosome_name", > > "start_position", > > "end_position", > > "strand", > > "ensembl_gene_id", > > "external_gene_id", > > "gene_biotype"), > > mart=hg19) > > } > > } > > > > unfiltered<-.getGenes() > > > > chrs<-c(unlist(seq(1,22)),"X","Y","MT") > > filtered<-.getGenes(chrs) > > > > chronly<-lapply(chrs, > > function(x){ > > length(.getGenes(x)$ensembl_gene_id) > > }) > > names(chronly)<-chrs > > > > allgenes<-table(unfiltered$chromosome_name) > > missinggenes<-table(unfiltered$chromosome_name[which( > > !(unfiltered$ensembl_gene_id > > %in% > > filtered$ensembl_gene_id))]) > > > > unlist(lapply(chrs, > > function(x){ > > totalmissing<-ifelseis.na(missinggenes[x]),0,missinggenes[x]) > > paste("Missing genes on chr",x,totalmissing,"of", > > "individually",chronly[[x]],"all",allgenes[x]) > > })) > > > > sessionInfo() > > > > > > > > > > > > > > -- output of sessionInfo(): > > > > Loading required package: biomaRt > > Loading required package: methods > > [1] "Missing genes on chr 1 0 of individually 5321 all 5321" > > [2] "Missing genes on chr 2 0 of individually 3990 all 3990" > > [3] "Missing genes on chr 3 0 of individually 3043 all 3043" > > [4] "Missing genes on chr 4 0 of individually 2521 all 2521" > > [5] "Missing genes on chr 5 0 of individually 2856 all 2856" > > [6] "Missing genes on chr 6 0 of individually 2906 all 2906" > > [7] "Missing genes on chr 7 0 of individually 2818 all 2818" > > [8] "Missing genes on chr 8 607 of individually 2385 all 2385" > > [9] "Missing genes on chr 9 1892 of individually 2303 all 2303" > > [10] "Missing genes on chr 10 0 of individually 2216 all 2216" > > [11] "Missing genes on chr 11 0 of individually 3190 all 3190" > > [12] "Missing genes on chr 12 0 of individually 2819 all 2819" > > [13] "Missing genes on chr 13 0 of individually 1217 all 1217" > > [14] "Missing genes on chr 14 0 of individually 2237 all 2237" > > [15] "Missing genes on chr 15 0 of individually 2076 all 2076" > > [16] "Missing genes on chr 16 0 of individually 2360 all 2360" > > [17] "Missing genes on chr 17 0 of individually 2901 all 2901" > > [18] "Missing genes on chr 18 0 of individually 1113 all 1113" > > [19] "Missing genes on chr 19 0 of individually 2917 all 2917" > > [20] "Missing genes on chr 20 0 of individually 1322 all 1322" > > [21] "Missing genes on chr 21 0 of individually 720 all 720" > > [22] "Missing genes on chr 22 0 of individually 1208 all 1208" > > [23] "Missing genes on chr X 2028 of individually 2414 all 2414" > > [24] "Missing genes on chr Y 506 of individually 506 all 506" > > [25] "Missing genes on chr MT 37 of individually 37 all 37" > > R version 3.0.2 (2013-09-25) > > Platform: x86_64-apple-darwin10.8.0 (64-bit) > > > > locale: > > [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 > > > > attached base packages: > > [1] methods stats graphics grDevices utils datasets base > > > > other attached packages: > > [1] biomaRt_2.18.0 > > > > loaded via a namespace (and not attached): > > [1] RCurl_1.95-4.1 XML_3.95-0.2 > > > > -- > > Sent via the guest posting facility at bioconductor.org. > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > -- > Thomas Maurel > Bioinformatician - Ensembl Production Team > European Bioinformatics Institute (EMBL-EBI) > European Molecular Biology Laboratory > Wellcome Trust Genome Campus > Hinxton > Cambridge CB10 1SD > United Kingdom > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
ADD REPLYlink written 5.3 years ago by Steffen Durinck540
Dear Steffen, We had similar issues caused by server loads in the past. This query is pointing to biomart.org and since we have no control over this server we can't check. Hope this helps. Regards, Thomas On 7 Feb 2014, at 16:10, Steffen Durinck <durinck.steffen@gene.com> wrote: > Hi Thomas, > > A "big" query like this has never been a problem before, can you shed some light at what could be causing this? Were there some changes at Ensembl? > > Best regards, > Steffen > > > On Fri, Feb 7, 2014 at 2:16 AM, Thomas Maurel <maurel@ebi.ac.uk> wrote: > Dear Pieta, > > As you have noticed in your queries, there is a danger of getting truncated results with biomart when querying a big organisms such as human without any filters. To be sure that you get all the genes back I would advise you to filter on each chromosomes individually. > > Hope this helps, > Thomas > On 4 Feb 2014, at 17:45, "Pieta Schofield [guest]" <guest@bioconductor.org> wrote: > > > > > Hullo > > > > I am getting some inconsistent results with my attempts to filter genes by chromosome when retrieving them form "hsapiens_gene_ensembl" > > > > if I specify a list of chromosomes to retrieve from I get a different number for some of the chromosomes than if i retrieve all the genes for all chromosomes or if I retrieve the genes from the chromosomes individually. > > > > this is my full code of the problem > > > > thanks in advance > > > > #!/usr/bin/Rscript --vanilla > > require(biomaRt) > > > > .getGenes <- function(chrs=list()) > > { # use biomart to get ranges for feature > > hg19 <- useMart("ensembl",dataset="hsapiens_gene_ensembl") > > if(length(chrs)>0) > > { > > getBM(attributes=c("chromosome_name", > > "start_position", > > "end_position", > > "strand", > > "ensembl_gene_id", > > "external_gene_id", > > "gene_biotype"), > > filter="chromosome_name", > > values=chrs, > > mart=hg19) > > }else{ > > getBM(attributes=c("chromosome_name", > > "start_position", > > "end_position", > > "strand", > > "ensembl_gene_id", > > "external_gene_id", > > "gene_biotype"), > > mart=hg19) > > } > > } > > > > unfiltered<-.getGenes() > > > > chrs<-c(unlist(seq(1,22)),"X","Y","MT") > > filtered<-.getGenes(chrs) > > > > chronly<-lapply(chrs, > > function(x){ > > length(.getGenes(x)$ensembl_gene_id) > > }) > > names(chronly)<-chrs > > > > allgenes<-table(unfiltered$chromosome_name) > > missinggenes<-table(unfiltered$chromosome_name[which( > > !(unfiltered$ensembl_gene_id > > %in% > > filtered$ensembl_gene_id))]) > > > > unlist(lapply(chrs, > > function(x){ > > totalmissing<-ifelseis.na(missinggenes[x]),0,missinggenes[x]) > > paste("Missing genes on chr",x,totalmissing,"of", > > "individually",chronly[[x]],"all",allgenes[x]) > > })) > > > > sessionInfo() > > > > > > > > > > > > > > -- output of sessionInfo(): > > > > Loading required package: biomaRt > > Loading required package: methods > > [1] "Missing genes on chr 1 0 of individually 5321 all 5321" > > [2] "Missing genes on chr 2 0 of individually 3990 all 3990" > > [3] "Missing genes on chr 3 0 of individually 3043 all 3043" > > [4] "Missing genes on chr 4 0 of individually 2521 all 2521" > > [5] "Missing genes on chr 5 0 of individually 2856 all 2856" > > [6] "Missing genes on chr 6 0 of individually 2906 all 2906" > > [7] "Missing genes on chr 7 0 of individually 2818 all 2818" > > [8] "Missing genes on chr 8 607 of individually 2385 all 2385" > > [9] "Missing genes on chr 9 1892 of individually 2303 all 2303" > > [10] "Missing genes on chr 10 0 of individually 2216 all 2216" > > [11] "Missing genes on chr 11 0 of individually 3190 all 3190" > > [12] "Missing genes on chr 12 0 of individually 2819 all 2819" > > [13] "Missing genes on chr 13 0 of individually 1217 all 1217" > > [14] "Missing genes on chr 14 0 of individually 2237 all 2237" > > [15] "Missing genes on chr 15 0 of individually 2076 all 2076" > > [16] "Missing genes on chr 16 0 of individually 2360 all 2360" > > [17] "Missing genes on chr 17 0 of individually 2901 all 2901" > > [18] "Missing genes on chr 18 0 of individually 1113 all 1113" > > [19] "Missing genes on chr 19 0 of individually 2917 all 2917" > > [20] "Missing genes on chr 20 0 of individually 1322 all 1322" > > [21] "Missing genes on chr 21 0 of individually 720 all 720" > > [22] "Missing genes on chr 22 0 of individually 1208 all 1208" > > [23] "Missing genes on chr X 2028 of individually 2414 all 2414" > > [24] "Missing genes on chr Y 506 of individually 506 all 506" > > [25] "Missing genes on chr MT 37 of individually 37 all 37" > > R version 3.0.2 (2013-09-25) > > Platform: x86_64-apple-darwin10.8.0 (64-bit) > > > > locale: > > [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 > > > > attached base packages: > > [1] methods stats graphics grDevices utils datasets base > > > > other attached packages: > > [1] biomaRt_2.18.0 > > > > loaded via a namespace (and not attached): > > [1] RCurl_1.95-4.1 XML_3.95-0.2 > > > > -- > > Sent via the guest posting facility at bioconductor.org. > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > -- > Thomas Maurel > Bioinformatician - Ensembl Production Team > European Bioinformatics Institute (EMBL-EBI) > European Molecular Biology Laboratory > Wellcome Trust Genome Campus > Hinxton > Cambridge CB10 1SD > United Kingdom > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Thomas Maurel Bioinformatician - Ensembl Production Team European Bioinformatics Institute (EMBL-EBI) European Molecular Biology Laboratory Wellcome Trust Genome Campus Hinxton Cambridge CB10 1SD United Kingdom [[alternative HTML version deleted]]
ADD REPLYlink written 5.3 years ago by Thomas Maurel770
Hi, In addition to Thomas' advice below, you might find it helpful to use some of the offline resources Bioconductor makes available for these purposes. For instance, it looks like you want to retrieve all of the genes (and their genomic coordinates) for human. The GenomicFeatures package can compile this information for you from different online resources (ie. ucsc or biomart) and will construct an object that you can query offline. There is already a "transcript database" package constructed for you using UCSC gene models: http://bioconductor.org/packages/release/data/annotation/html/TxDb.Hsa piens.UCSC.hg19.knownGene.html But you could also build a custom transcript db package using ensembl transcripts alone. Reading through the documentation (vignettes) available here should get you started: http://bioconductor.org/packages/release/bioc/html/GenomicFeatures.htm l HTH, -steve On Fri, Feb 7, 2014 at 2:16 AM, Thomas Maurel <maurel at="" ebi.ac.uk=""> wrote: > Dear Pieta, > > As you have noticed in your queries, there is a danger of getting truncated results with biomart when querying a big organisms such as human without any filters. To be sure that you get all the genes back I would advise you to filter on each chromosomes individually. > > Hope this helps, > Thomas > On 4 Feb 2014, at 17:45, "Pieta Schofield [guest]" <guest at="" bioconductor.org=""> wrote: > >> >> Hullo >> >> I am getting some inconsistent results with my attempts to filter genes by chromosome when retrieving them form "hsapiens_gene_ensembl" >> >> if I specify a list of chromosomes to retrieve from I get a different number for some of the chromosomes than if i retrieve all the genes for all chromosomes or if I retrieve the genes from the chromosomes individually. >> >> this is my full code of the problem >> >> thanks in advance >> >> #!/usr/bin/Rscript --vanilla >> require(biomaRt) >> >> .getGenes <- function(chrs=list()) >> { # use biomart to get ranges for feature >> hg19 <- useMart("ensembl",dataset="hsapiens_gene_ensembl") >> if(length(chrs)>0) >> { >> getBM(attributes=c("chromosome_name", >> "start_position", >> "end_position", >> "strand", >> "ensembl_gene_id", >> "external_gene_id", >> "gene_biotype"), >> filter="chromosome_name", >> values=chrs, >> mart=hg19) >> }else{ >> getBM(attributes=c("chromosome_name", >> "start_position", >> "end_position", >> "strand", >> "ensembl_gene_id", >> "external_gene_id", >> "gene_biotype"), >> mart=hg19) >> } >> } >> >> unfiltered<-.getGenes() >> >> chrs<-c(unlist(seq(1,22)),"X","Y","MT") >> filtered<-.getGenes(chrs) >> >> chronly<-lapply(chrs, >> function(x){ >> length(.getGenes(x)$ensembl_gene_id) >> }) >> names(chronly)<-chrs >> >> allgenes<-table(unfiltered$chromosome_name) >> missinggenes<-table(unfiltered$chromosome_name[which( >> !(unfiltered$ensembl_gene_id >> %in% >> filtered$ensembl_gene_id))]) >> >> unlist(lapply(chrs, >> function(x){ >> totalmissing<-ifelseis.na(missinggenes[x]),0,missinggenes[x]) >> paste("Missing genes on chr",x,totalmissing,"of", >> "individually",chronly[[x]],"all",allgenes[x]) >> })) >> >> sessionInfo() >> >> >> >> >> >> >> -- output of sessionInfo(): >> >> Loading required package: biomaRt >> Loading required package: methods >> [1] "Missing genes on chr 1 0 of individually 5321 all 5321" >> [2] "Missing genes on chr 2 0 of individually 3990 all 3990" >> [3] "Missing genes on chr 3 0 of individually 3043 all 3043" >> [4] "Missing genes on chr 4 0 of individually 2521 all 2521" >> [5] "Missing genes on chr 5 0 of individually 2856 all 2856" >> [6] "Missing genes on chr 6 0 of individually 2906 all 2906" >> [7] "Missing genes on chr 7 0 of individually 2818 all 2818" >> [8] "Missing genes on chr 8 607 of individually 2385 all 2385" >> [9] "Missing genes on chr 9 1892 of individually 2303 all 2303" >> [10] "Missing genes on chr 10 0 of individually 2216 all 2216" >> [11] "Missing genes on chr 11 0 of individually 3190 all 3190" >> [12] "Missing genes on chr 12 0 of individually 2819 all 2819" >> [13] "Missing genes on chr 13 0 of individually 1217 all 1217" >> [14] "Missing genes on chr 14 0 of individually 2237 all 2237" >> [15] "Missing genes on chr 15 0 of individually 2076 all 2076" >> [16] "Missing genes on chr 16 0 of individually 2360 all 2360" >> [17] "Missing genes on chr 17 0 of individually 2901 all 2901" >> [18] "Missing genes on chr 18 0 of individually 1113 all 1113" >> [19] "Missing genes on chr 19 0 of individually 2917 all 2917" >> [20] "Missing genes on chr 20 0 of individually 1322 all 1322" >> [21] "Missing genes on chr 21 0 of individually 720 all 720" >> [22] "Missing genes on chr 22 0 of individually 1208 all 1208" >> [23] "Missing genes on chr X 2028 of individually 2414 all 2414" >> [24] "Missing genes on chr Y 506 of individually 506 all 506" >> [25] "Missing genes on chr MT 37 of individually 37 all 37" >> R version 3.0.2 (2013-09-25) >> Platform: x86_64-apple-darwin10.8.0 (64-bit) >> >> locale: >> [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 >> >> attached base packages: >> [1] methods stats graphics grDevices utils datasets base >> >> other attached packages: >> [1] biomaRt_2.18.0 >> >> loaded via a namespace (and not attached): >> [1] RCurl_1.95-4.1 XML_3.95-0.2 >> >> -- >> Sent via the guest posting facility at bioconductor.org. >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > -- > Thomas Maurel > Bioinformatician - Ensembl Production Team > European Bioinformatics Institute (EMBL-EBI) > European Molecular Biology Laboratory > Wellcome Trust Genome Campus > Hinxton > Cambridge CB10 1SD > United Kingdom > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Steve Lianoglou Computational Biologist Genentech
ADD REPLYlink written 5.3 years ago by Steve Lianoglou12k
Answer: Problem with filtering genes in biomaRt
0
gravatar for Thomas Maurel
5.3 years ago by
Thomas Maurel770
United Kingdom
Thomas Maurel770 wrote:
Hi Pieta, I've just noticed that you are querying against biomart.org, could you please try the same test but with your host pointing to ensembl.org: mart <- useMart(biomart="ENSEMBL_MART_ENSEMBL", host="www.ensembl.org", path="/biomart/martservice") Please let me know if this changes anything. Cheers, Thomas On 7 Feb 2014, at 10:49, Pieta Schofield <p.schofield@dundee.ac.uk> wrote: > Hullo Thomas > > Thank you for the reply, the problem was not simply truncating when not using any filter, it was when the filter list included more than around 5 terms, but yes I have no problem when I do the chromosomes individually and seeing I have access to parallel cores and it makes sense to do downstream processing in parallel I have amended my code so each core now just pulls the chromosome annotations it is going to work on. > > thanks again > > Pieta > On 7 Feb 14, at 10:16, Thomas Maurel <maurel@ebi.ac.uk> wrote: > >> Dear Pieta, >> >> As you have noticed in your queries, there is a danger of getting truncated results with biomart when querying a big organisms such as human without any filters. To be sure that you get all the genes back I would advise you to filter on each chromosomes individually. >> >> Hope this helps, >> Thomas >> On 4 Feb 2014, at 17:45, "Pieta Schofield [guest]" <guest@bioconductor.org> wrote: >> >>> >>> Hullo >>> >>> I am getting some inconsistent results with my attempts to filter genes by chromosome when retrieving them form "hsapiens_gene_ensembl" >>> >>> if I specify a list of chromosomes to retrieve from I get a different number for some of the chromosomes than if i retrieve all the genes for all chromosomes or if I retrieve the genes from the chromosomes individually. >>> >>> this is my full code of the problem >>> >>> thanks in advance >>> >>> #!/usr/bin/Rscript --vanilla >>> require(biomaRt) >>> >>> .getGenes <- function(chrs=list()) >>> { # use biomart to get ranges for feature >>> hg19 <- useMart("ensembl",dataset="hsapiens_gene_ensembl") >>> if(length(chrs)>0) >>> { >>> getBM(attributes=c("chromosome_name", >>> "start_position", >>> "end_position", >>> "strand", >>> "ensembl_gene_id", >>> "external_gene_id", >>> "gene_biotype"), >>> filter="chromosome_name", >>> values=chrs, >>> mart=hg19) >>> }else{ >>> getBM(attributes=c("chromosome_name", >>> "start_position", >>> "end_position", >>> "strand", >>> "ensembl_gene_id", >>> "external_gene_id", >>> "gene_biotype"), >>> mart=hg19) >>> } >>> } >>> >>> unfiltered<-.getGenes() >>> >>> chrs<-c(unlist(seq(1,22)),"X","Y","MT") >>> filtered<-.getGenes(chrs) >>> >>> chronly<-lapply(chrs, >>> function(x){ >>> length(.getGenes(x)$ensembl_gene_id) >>> }) >>> names(chronly)<-chrs >>> >>> allgenes<-table(unfiltered$chromosome_name) >>> missinggenes<-table(unfiltered$chromosome_name[which( >>> !(unfiltered$ensembl_gene_id >>> %in% >>> filtered$ensembl_gene_id))]) >>> >>> unlist(lapply(chrs, >>> function(x){ >>> totalmissing<-ifelseis.na(missinggenes[x]),0,missinggenes[x]) >>> paste("Missing genes on chr",x,totalmissing,"of", >>> "individually",chronly[[x]],"all",allgenes[x]) >>> })) >>> >>> sessionInfo() >>> >>> >>> >>> >>> >>> >>> -- output of sessionInfo(): >>> >>> Loading required package: biomaRt >>> Loading required package: methods >>> [1] "Missing genes on chr 1 0 of individually 5321 all 5321" >>> [2] "Missing genes on chr 2 0 of individually 3990 all 3990" >>> [3] "Missing genes on chr 3 0 of individually 3043 all 3043" >>> [4] "Missing genes on chr 4 0 of individually 2521 all 2521" >>> [5] "Missing genes on chr 5 0 of individually 2856 all 2856" >>> [6] "Missing genes on chr 6 0 of individually 2906 all 2906" >>> [7] "Missing genes on chr 7 0 of individually 2818 all 2818" >>> [8] "Missing genes on chr 8 607 of individually 2385 all 2385" >>> [9] "Missing genes on chr 9 1892 of individually 2303 all 2303" >>> [10] "Missing genes on chr 10 0 of individually 2216 all 2216" >>> [11] "Missing genes on chr 11 0 of individually 3190 all 3190" >>> [12] "Missing genes on chr 12 0 of individually 2819 all 2819" >>> [13] "Missing genes on chr 13 0 of individually 1217 all 1217" >>> [14] "Missing genes on chr 14 0 of individually 2237 all 2237" >>> [15] "Missing genes on chr 15 0 of individually 2076 all 2076" >>> [16] "Missing genes on chr 16 0 of individually 2360 all 2360" >>> [17] "Missing genes on chr 17 0 of individually 2901 all 2901" >>> [18] "Missing genes on chr 18 0 of individually 1113 all 1113" >>> [19] "Missing genes on chr 19 0 of individually 2917 all 2917" >>> [20] "Missing genes on chr 20 0 of individually 1322 all 1322" >>> [21] "Missing genes on chr 21 0 of individually 720 all 720" >>> [22] "Missing genes on chr 22 0 of individually 1208 all 1208" >>> [23] "Missing genes on chr X 2028 of individually 2414 all 2414" >>> [24] "Missing genes on chr Y 506 of individually 506 all 506" >>> [25] "Missing genes on chr MT 37 of individually 37 all 37" >>> R version 3.0.2 (2013-09-25) >>> Platform: x86_64-apple-darwin10.8.0 (64-bit) >>> >>> locale: >>> [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 >>> >>> attached base packages: >>> [1] methods stats graphics grDevices utils datasets base >>> >>> other attached packages: >>> [1] biomaRt_2.18.0 >>> >>> loaded via a namespace (and not attached): >>> [1] RCurl_1.95-4.1 XML_3.95-0.2 >>> >>> -- >>> Sent via the guest posting facility at bioconductor.org. >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor@r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> -- >> Thomas Maurel >> Bioinformatician - Ensembl Production Team >> European Bioinformatics Institute (EMBL-EBI) >> European Molecular Biology Laboratory >> Wellcome Trust Genome Campus >> Hinxton >> Cambridge CB10 1SD >> United Kingdom >> >> >> The University of Dundee is a registered Scottish Charity, No: SC015096 > > > The University of Dundee is a registered Scottish Charity, No: SC015096 -- Thomas Maurel Bioinformatician - Ensembl Production Team European Bioinformatics Institute (EMBL-EBI) European Molecular Biology Laboratory Wellcome Trust Genome Campus Hinxton Cambridge CB10 1SD United Kingdom [[alternative HTML version deleted]]
ADD COMMENTlink written 5.3 years ago by Thomas Maurel770
Answer: Problem with filtering genes in biomaRt
0
gravatar for Thomas Maurel
5.3 years ago by
Thomas Maurel770
United Kingdom
Thomas Maurel770 wrote:
Hi Pieta, I am happy to hear that. BioMart central portal at www.biomart.org is the default server used by biomaRt and holds biomart databases from different projects. The Ensembl website www.ensembl.org) holds the Ensembl mart databases that we generate in the Ensembl project. When you pointing your host to: >> mart <- useMart(biomart="ENSEMBL_MART_ENSEMBL", host="www.ensembl.org", path="/biomart/martservice") You use the www.ensembl.org server instead of the biomart.org server. Hope this helps, Thomas On 7 Feb 2014, at 16:35, Pieta Schofield <p.schofield@dundee.ac.uk> wrote: > Hullo Thomas > > That seems to have worked, no missing genes > > Loading required package: biomaRt > Loading required package: methods > [1] "Missing genes on chr 1 0 of individually 5363 all 5363" > [2] "Missing genes on chr 2 0 of individually 4047 all 4047" > [3] "Missing genes on chr 3 0 of individually 3101 all 3101" > [4] "Missing genes on chr 4 0 of individually 2563 all 2563" > [5] "Missing genes on chr 5 0 of individually 2859 all 2859" > [6] "Missing genes on chr 6 0 of individually 2905 all 2905" > [7] "Missing genes on chr 7 0 of individually 2876 all 2876" > [8] "Missing genes on chr 8 0 of individually 2386 all 2386" > [9] "Missing genes on chr 9 0 of individually 2323 all 2323" > [10] "Missing genes on chr 10 0 of individually 2260 all 2260" > [11] "Missing genes on chr 11 0 of individually 3208 all 3208" > [12] "Missing genes on chr 12 0 of individually 2818 all 2818" > [13] "Missing genes on chr 13 0 of individually 1217 all 1217" > [14] "Missing genes on chr 14 0 of individually 2244 all 2244" > [15] "Missing genes on chr 15 0 of individually 2080 all 2080" > [16] "Missing genes on chr 16 0 of individually 2343 all 2343" > [17] "Missing genes on chr 17 0 of individually 2903 all 2903" > [18] "Missing genes on chr 18 0 of individually 1127 all 1127" > [19] "Missing genes on chr 19 0 of individually 2910 all 2910" > [20] "Missing genes on chr 20 0 of individually 1317 all 1317" > [21] "Missing genes on chr 21 0 of individually 736 all 736" > [22] "Missing genes on chr 22 0 of individually 1263 all 1263" > [23] "Missing genes on chr X 0 of individually 2392 all 2392" > [24] "Missing genes on chr Y 0 of individually 495 all 495" > [25] "Missing genes on chr MT 0 of individually 37 all 37" > R version 3.0.2 (2013-09-25) > Platform: x86_64-apple-darwin10.8.0 (64-bit) > > locale: > [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 > > attached base packages: > [1] methods stats graphics grDevices utils datasets base > > other attached packages: > [1] biomaRt_2.18.0 > > loaded via a namespace (and not attached): > [1] RCurl_1.95-4.1 XML_3.95-0.2 > pieta-mba> > > Thanks what is the difference? > > Pieta > > On 7 Feb 14, at 16:18, Thomas Maurel <maurel@ebi.ac.uk> wrote: > >> Hi Pieta, >> >> I've just noticed that you are querying against biomart.org, could you please try the same test but with your host pointing to ensembl.org: >> >> mart <- useMart(biomart="ENSEMBL_MART_ENSEMBL", host="www.ensembl.org", path="/biomart/martservice") >> >> Please let me know if this changes anything. >> Cheers, >> Thomas >> On 7 Feb 2014, at 10:49, Pieta Schofield <p.schofield@dundee.ac.uk> wrote: >> >>> Hullo Thomas >>> >>> Thank you for the reply, the problem was not simply truncating when not using any filter, it was when the filter list included more than around 5 terms, but yes I have no problem when I do the chromosomes individually and seeing I have access to parallel cores and it makes sense to do downstream processing in parallel I have amended my code so each core now just pulls the chromosome annotations it is going to work on. >>> >>> thanks again >>> >>> Pieta >>> On 7 Feb 14, at 10:16, Thomas Maurel <maurel@ebi.ac.uk> wrote: >>> >>>> Dear Pieta, >>>> >>>> As you have noticed in your queries, there is a danger of getting truncated results with biomart when querying a big organisms such as human without any filters. To be sure that you get all the genes back I would advise you to filter on each chromosomes individually. >>>> >>>> Hope this helps, >>>> Thomas >>>> On 4 Feb 2014, at 17:45, "Pieta Schofield [guest]" <guest@bioconductor.org> wrote: >>>> >>>>> >>>>> Hullo >>>>> >>>>> I am getting some inconsistent results with my attempts to filter genes by chromosome when retrieving them form "hsapiens_gene_ensembl" >>>>> >>>>> if I specify a list of chromosomes to retrieve from I get a different number for some of the chromosomes than if i retrieve all the genes for all chromosomes or if I retrieve the genes from the chromosomes individually. >>>>> >>>>> this is my full code of the problem >>>>> >>>>> thanks in advance >>>>> >>>>> #!/usr/bin/Rscript --vanilla >>>>> require(biomaRt) >>>>> >>>>> .getGenes <- function(chrs=list()) >>>>> { # use biomart to get ranges for feature >>>>> hg19 <- useMart("ensembl",dataset="hsapiens_gene_ensembl") >>>>> if(length(chrs)>0) >>>>> { >>>>> getBM(attributes=c("chromosome_name", >>>>> "start_position", >>>>> "end_position", >>>>> "strand", >>>>> "ensembl_gene_id", >>>>> "external_gene_id", >>>>> "gene_biotype"), >>>>> filter="chromosome_name", >>>>> values=chrs, >>>>> mart=hg19) >>>>> }else{ >>>>> getBM(attributes=c("chromosome_name", >>>>> "start_position", >>>>> "end_position", >>>>> "strand", >>>>> "ensembl_gene_id", >>>>> "external_gene_id", >>>>> "gene_biotype"), >>>>> mart=hg19) >>>>> } >>>>> } >>>>> >>>>> unfiltered<-.getGenes() >>>>> >>>>> chrs<-c(unlist(seq(1,22)),"X","Y","MT") >>>>> filtered<-.getGenes(chrs) >>>>> >>>>> chronly<-lapply(chrs, >>>>> function(x){ >>>>> length(.getGenes(x)$ensembl_gene_id) >>>>> }) >>>>> names(chronly)<-chrs >>>>> >>>>> allgenes<-table(unfiltered$chromosome_name) >>>>> missinggenes<-table(unfiltered$chromosome_name[which( >>>>> !(unfiltered$ensembl_gene_id >>>>> %in% >>>>> filtered$ensembl_gene_id))]) >>>>> >>>>> unlist(lapply(chrs, >>>>> function(x){ >>>>> totalmissing<-ifelseis.na(missinggenes[x]),0,missinggenes[x]) >>>>> paste("Missing genes on chr",x,totalmissing,"of", >>>>> "individually",chronly[[x]],"all",allgenes[x]) >>>>> })) >>>>> >>>>> sessionInfo() >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> -- output of sessionInfo(): >>>>> >>>>> Loading required package: biomaRt >>>>> Loading required package: methods >>>>> [1] "Missing genes on chr 1 0 of individually 5321 all 5321" >>>>> [2] "Missing genes on chr 2 0 of individually 3990 all 3990" >>>>> [3] "Missing genes on chr 3 0 of individually 3043 all 3043" >>>>> [4] "Missing genes on chr 4 0 of individually 2521 all 2521" >>>>> [5] "Missing genes on chr 5 0 of individually 2856 all 2856" >>>>> [6] "Missing genes on chr 6 0 of individually 2906 all 2906" >>>>> [7] "Missing genes on chr 7 0 of individually 2818 all 2818" >>>>> [8] "Missing genes on chr 8 607 of individually 2385 all 2385" >>>>> [9] "Missing genes on chr 9 1892 of individually 2303 all 2303" >>>>> [10] "Missing genes on chr 10 0 of individually 2216 all 2216" >>>>> [11] "Missing genes on chr 11 0 of individually 3190 all 3190" >>>>> [12] "Missing genes on chr 12 0 of individually 2819 all 2819" >>>>> [13] "Missing genes on chr 13 0 of individually 1217 all 1217" >>>>> [14] "Missing genes on chr 14 0 of individually 2237 all 2237" >>>>> [15] "Missing genes on chr 15 0 of individually 2076 all 2076" >>>>> [16] "Missing genes on chr 16 0 of individually 2360 all 2360" >>>>> [17] "Missing genes on chr 17 0 of individually 2901 all 2901" >>>>> [18] "Missing genes on chr 18 0 of individually 1113 all 1113" >>>>> [19] "Missing genes on chr 19 0 of individually 2917 all 2917" >>>>> [20] "Missing genes on chr 20 0 of individually 1322 all 1322" >>>>> [21] "Missing genes on chr 21 0 of individually 720 all 720" >>>>> [22] "Missing genes on chr 22 0 of individually 1208 all 1208" >>>>> [23] "Missing genes on chr X 2028 of individually 2414 all 2414" >>>>> [24] "Missing genes on chr Y 506 of individually 506 all 506" >>>>> [25] "Missing genes on chr MT 37 of individually 37 all 37" >>>>> R version 3.0.2 (2013-09-25) >>>>> Platform: x86_64-apple-darwin10.8.0 (64-bit) >>>>> >>>>> locale: >>>>> [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 >>>>> >>>>> attached base packages: >>>>> [1] methods stats graphics grDevices utils datasets base >>>>> >>>>> other attached packages: >>>>> [1] biomaRt_2.18.0 >>>>> >>>>> loaded via a namespace (and not attached): >>>>> [1] RCurl_1.95-4.1 XML_3.95-0.2 >>>>> >>>>> -- >>>>> Sent via the guest posting facility at bioconductor.org. >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor@r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>>> -- >>>> Thomas Maurel >>>> Bioinformatician - Ensembl Production Team >>>> European Bioinformatics Institute (EMBL-EBI) >>>> European Molecular Biology Laboratory >>>> Wellcome Trust Genome Campus >>>> Hinxton >>>> Cambridge CB10 1SD >>>> United Kingdom >>>> >>>> >>>> The University of Dundee is a registered Scottish Charity, No: SC015096 >>> >>> >>> The University of Dundee is a registered Scottish Charity, No: SC015096 >> >> -- >> Thomas Maurel >> Bioinformatician - Ensembl Production Team >> European Bioinformatics Institute (EMBL-EBI) >> European Molecular Biology Laboratory >> Wellcome Trust Genome Campus >> Hinxton >> Cambridge CB10 1SD >> United Kingdom >> >> >> The University of Dundee is a registered Scottish Charity, No: SC015096 > > > The University of Dundee is a registered Scottish Charity, No: SC015096 -- Thomas Maurel Bioinformatician - Ensembl Production Team European Bioinformatics Institute (EMBL-EBI) European Molecular Biology Laboratory Wellcome Trust Genome Campus Hinxton Cambridge CB10 1SD United Kingdom [[alternative HTML version deleted]]
ADD COMMENTlink written 5.3 years ago by Thomas Maurel770
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 240 users visited in the last hour