biomaRt- incorrect number of transcripts

0

Entering edit mode

steffen@stat.Berkeley.EDU ▴ 600

@steffenstatberkeleyedu-2907

Last seen 11.3 years ago

Dear Robert, Would it be possible to check if there are duplicates in the result you obtain via the web? By default biomaRt will retrieve only unique results, sometimes when you query over the web results are duplicated. To remove these you need to check the unique only checkbox when exporting your results to a file. Can you let me know if that explains the difference in number of transcripts you notice? Cheers, Steffen > Dear mailing list, > > I have recently observed a discrepancies in genome annotation obtained > via R package biomaRt. > I wanted to download all ensembl transcripts from the entire mouse > genome (chromosome 1:19, X, Y MT only). > > When I set the filter based on chromosome names I retrieved ~36000 > transcript, please see the code below. > However by using the web service www.biomart.org I received ~48000 > transcripts for the same genome version and chromosomes. > > By comparing these two data frames you could see that the discrepancies > in number of transcripts occur only for some chromosomes (3:9 and X). > If I specified only two chromosome names (2 and 3) than the number of > downloaded transcripts is correct for both of them. > If I did not set any filter in getBM function and did the filtering > manually in R, the number of transcripts is correct. > > Session info is attached. > > Best Regards > Robert > > -- > Robert Ivanek > Postdoctoral Fellow Schuebeler Group > Friedrich Miescher Institute > Maulbeerstrasse 66 > 4058 Basel / Switzerland > Office phone: +41 61 697 6100 > > > R> library("biomaRt") > R> ensembl <- useMart("ensembl", dataset = "mmusculus_gene_ensembl") > R> chroms <- c(1:19,"X","Y","MT") > R> table(getBM(attributes = c("ensembl_transcript_id", > "chromosome_name", "strand", "transcript_start", "transcript_end"), > filters = "chromosome_name", values = chroms, mart = > ensembl)$chromosome_name) > > 1 10 11 12 13 14 15 16 17 18 19 2 3 4 > 5 6 7 8 9 MT X Y > 2507 1869 4364 1501 1630 1624 1404 1522 1865 985 1245 5232 1080 1454 > 845 1209 1487 1129 1031 41 2072 17 > > R> ens.web <- read.delim("../../../mart_export.txt",stringsAsFactors=F) > R> ens.web <- ens.web[ens.web$Chromosome.Name %in% chroms,] > R> table(ens.web$Chromosome.Name) > > 1 10 11 12 13 14 15 16 17 18 19 2 3 4 > 5 6 7 8 9 MT X Y > 2507 1869 4364 1501 1630 1624 1404 1522 1865 985 1245 5232 2179 3997 > 2822 2524 3919 2021 2163 41 3297 17 > > R> table(getBM(attributes = c("ensembl_transcript_id", > "chromosome_name", "strand", "transcript_start", "transcript_end"), > filters = "chromosome_name", values = c("2","3","MT"), mart = > ensembl)$chromosome_name) > > 2 3 MT > 5232 2179 41 > > > R> ens.r <- getBM(attributes = c("ensembl_transcript_id", > "chromosome_name", "strand", "transcript_start", "transcript_end"), mart > = ensembl) > R> ens.r <- ens.r[ens.r$chromosome_name %in% chroms,] > R> table(ens.r$chromosome_name) > > 1 10 11 12 13 14 15 16 17 18 19 2 3 4 > 5 6 7 8 9 MT X Y > 2507 1869 4364 1501 1630 1624 1404 1522 1865 985 1245 5232 2179 3997 > 2822 2524 3919 2021 2163 41 3297 17 > > > > R> sessionInfo() > R version 2.10.0 (2009-10-26) > x86_64-unknown-linux-gnu > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=C > > [6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C > LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > > other attached packages: > [1] biomaRt_2.2.0 > > loaded via a namespace (and not attached): > [1] RCurl_1.3-0 tools_2.10.0 XML_2.6-0 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor >

biomaRt biomaRt • 1.7k views

ADD COMMENT • link updated 16.1 years ago by Rhoda Kinsella ▴ 660 • written 16.1 years ago by steffen@stat.Berkeley.EDU ▴ 600

0

Entering edit mode

Ivanek, Robert ▴ 100

@ivanek-robert-3765

Last seen 11.3 years ago

Dear Steffen, I originally used only unique transcripts. I guess that this is not the reason for different number of transcripts. As you can see below in the original mail: the incorrect number of transcripts is obtained via biomaRt package only for some chromosomes and only if the filter is set to all mouse chromosomes. However if you just set the filter for example to 3 chromosomes the number of transcripts is correct. The number of transcripts is also correct if you do not set any filter. Regards Robert -----Original Message----- From: Steffen@stat.Berkeley.EDU [mailto:Steffen@stat.Berkeley.EDU] Sent: Tuesday, November 10, 2009 8:26 PM To: Ivanek, Robert Cc: bioconductor at stat.math.ethz.ch Subject: Re: [BioC] biomaRt- incorrect number of transcripts Dear Robert, Would it be possible to check if there are duplicates in the result you obtain via the web? By default biomaRt will retrieve only unique results, sometimes when you query over the web results are duplicated. To remove these you need to check the unique only checkbox when exporting your results to a file. Can you let me know if that explains the difference in number of transcripts you notice? Cheers, Steffen > Dear mailing list, > > I have recently observed a discrepancies in genome annotation obtained > via R package biomaRt. > I wanted to download all ensembl transcripts from the entire mouse > genome (chromosome 1:19, X, Y MT only). > > When I set the filter based on chromosome names I retrieved ~36000 > transcript, please see the code below. > However by using the web service www.biomart.org I received ~48000 > transcripts for the same genome version and chromosomes. > > By comparing these two data frames you could see that the > discrepancies in number of transcripts occur only for some chromosomes (3:9 and X). > If I specified only two chromosome names (2 and 3) than the number of > downloaded transcripts is correct for both of them. > If I did not set any filter in getBM function and did the filtering > manually in R, the number of transcripts is correct. > > Session info is attached. > > Best Regards > Robert > > -- > Robert Ivanek > Postdoctoral Fellow Schuebeler Group > Friedrich Miescher Institute > Maulbeerstrasse 66 > 4058 Basel / Switzerland > Office phone: +41 61 697 6100 > > > R> library("biomaRt") > R> ensembl <- useMart("ensembl", dataset = "mmusculus_gene_ensembl") > R> chroms <- c(1:19,"X","Y","MT") table(getBM(attributes = > R> c("ensembl_transcript_id", > "chromosome_name", "strand", "transcript_start", "transcript_end"), > filters = "chromosome_name", values = chroms, mart = > ensembl)$chromosome_name) > > 1 10 11 12 13 14 15 16 17 18 19 2 3 4 > 5 6 7 8 9 MT X Y > 2507 1869 4364 1501 1630 1624 1404 1522 1865 985 1245 5232 1080 1454 > 845 1209 1487 1129 1031 41 2072 17 > > R> ens.web <- > R> read.delim("../../../mart_export.txt",stringsAsFactors=F) > R> ens.web <- ens.web[ens.web$Chromosome.Name %in% chroms,] > R> table(ens.web$Chromosome.Name) > > 1 10 11 12 13 14 15 16 17 18 19 2 3 4 > 5 6 7 8 9 MT X Y > 2507 1869 4364 1501 1630 1624 1404 1522 1865 985 1245 5232 2179 3997 > 2822 2524 3919 2021 2163 41 3297 17 > > R> table(getBM(attributes = c("ensembl_transcript_id", > "chromosome_name", "strand", "transcript_start", "transcript_end"), > filters = "chromosome_name", values = c("2","3","MT"), mart = > ensembl)$chromosome_name) > > 2 3 MT > 5232 2179 41 > > > R> ens.r <- getBM(attributes = c("ensembl_transcript_id", > "chromosome_name", "strand", "transcript_start", "transcript_end"), > mart = ensembl) > R> ens.r <- ens.r[ens.r$chromosome_name %in% chroms,] > R> table(ens.r$chromosome_name) > > 1 10 11 12 13 14 15 16 17 18 19 2 3 4 > 5 6 7 8 9 MT X Y > 2507 1869 4364 1501 1630 1624 1404 1522 1865 985 1245 5232 2179 3997 > 2822 2524 3919 2021 2163 41 3297 17 > > > > R> sessionInfo() > R version 2.10.0 (2009-10-26) > x86_64-unknown-linux-gnu > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=C > > [6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C > LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > > other attached packages: > [1] biomaRt_2.2.0 > > loaded via a namespace (and not attached): > [1] RCurl_1.3-0 tools_2.10.0 XML_2.6-0 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD COMMENT • link 16.1 years ago Ivanek, Robert ▴ 100

0

Entering edit mode

Rhoda Kinsella ▴ 660

@rhoda-kinsella-3200

Last seen 11.3 years ago

Dear Robert, I have looked into this query and it seems that you did not retrieve unique results from the Biomart interface. I have carried out your query using the webExmaple.pl script provided in the biomart-perl directory using unique for one query and not using it for a second run of this query. When I do not select uniqueRows I get ~48000 rows and when I select uniqueRows I get ~36000 rows. I have attached the XML for the query I performed with uniqueRows selected (uniqueRows = "1"). <query virtualschemaname="default" formatter="TSV" header="0" uniquerows="1" count="" datasetconfigversion="0.7"> <dataset name="mmusculus_gene_ensembl" interface="default"> <filter name="chromosome_name" value="1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,X,Y,MT"/> <attribute name="ensembl_transcript_id"/> <attribute name="chromosome_name"/> <attribute name="strand"/> <attribute name="transcript_start"/> <attribute name="transcript_end"/> </dataset> </query> I hope this resolves the issue for you but please do not hesitate to contact me if you need further clarification. Kind regards Rhoda On 10 Nov 2009, at 19:25, Steffen at stat.berkeley.edu wrote: > Dear Robert, > > Would it be possible to check if there are duplicates in the result > you > obtain via the web? By default biomaRt will retrieve only unique > results, > sometimes when you query over the web results are duplicated. To > remove > these you need to check the unique only checkbox when exporting your > results to a file. Can you let me know if that explains the > difference in > number of transcripts you notice? > > Cheers, > Steffen > >> Dear mailing list, >> >> I have recently observed a discrepancies in genome annotation >> obtained >> via R package biomaRt. >> I wanted to download all ensembl transcripts from the entire mouse >> genome (chromosome 1:19, X, Y MT only). >> >> When I set the filter based on chromosome names I retrieved ~36000 >> transcript, please see the code below. >> However by using the web service www.biomart.org I received ~48000 >> transcripts for the same genome version and chromosomes. >> >> By comparing these two data frames you could see that the >> discrepancies >> in number of transcripts occur only for some chromosomes (3:9 and X). >> If I specified only two chromosome names (2 and 3) than the number of >> downloaded transcripts is correct for both of them. >> If I did not set any filter in getBM function and did the filtering >> manually in R, the number of transcripts is correct. >> >> Session info is attached. >> >> Best Regards >> Robert >> >> -- >> Robert Ivanek >> Postdoctoral Fellow Schuebeler Group >> Friedrich Miescher Institute >> Maulbeerstrasse 66 >> 4058 Basel / Switzerland >> Office phone: +41 61 697 6100 >> >> >> R> library("biomaRt") >> R> ensembl <- useMart("ensembl", dataset = "mmusculus_gene_ensembl") >> R> chroms <- c(1:19,"X","Y","MT") >> R> table(getBM(attributes = c("ensembl_transcript_id", >> "chromosome_name", "strand", "transcript_start", "transcript_end"), >> filters = "chromosome_name", values = chroms, mart = >> ensembl)$chromosome_name) >> >> 1 10 11 12 13 14 15 16 17 18 19 2 3 4 >> 5 6 7 8 9 MT X Y >> 2507 1869 4364 1501 1630 1624 1404 1522 1865 985 1245 5232 1080 1454 >> 845 1209 1487 1129 1031 41 2072 17 >> >> R> ens.web <- read.delim("../../../ >> mart_export.txt",stringsAsFactors=F) >> R> ens.web <- ens.web[ens.web$Chromosome.Name %in% chroms,] >> R> table(ens.web$Chromosome.Name) >> >> 1 10 11 12 13 14 15 16 17 18 19 2 3 4 >> 5 6 7 8 9 MT X Y >> 2507 1869 4364 1501 1630 1624 1404 1522 1865 985 1245 5232 2179 3997 >> 2822 2524 3919 2021 2163 41 3297 17 >> >> R> table(getBM(attributes = c("ensembl_transcript_id", >> "chromosome_name", "strand", "transcript_start", "transcript_end"), >> filters = "chromosome_name", values = c("2","3","MT"), mart = >> ensembl)$chromosome_name) >> >> 2 3 MT >> 5232 2179 41 >> >> >> R> ens.r <- getBM(attributes = c("ensembl_transcript_id", >> "chromosome_name", "strand", "transcript_start", "transcript_end"), >> mart >> = ensembl) >> R> ens.r <- ens.r[ens.r$chromosome_name %in% chroms,] >> R> table(ens.r$chromosome_name) >> >> 1 10 11 12 13 14 15 16 17 18 19 2 3 4 >> 5 6 7 8 9 MT X Y >> 2507 1869 4364 1501 1630 1624 1404 1522 1865 985 1245 5232 2179 3997 >> 2822 2524 3919 2021 2163 41 3297 17 >> >> >> >> R> sessionInfo() >> R version 2.10.0 (2009-10-26) >> x86_64-unknown-linux-gnu >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=C >> >> [6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C >> LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> >> other attached packages: >> [1] biomaRt_2.2.0 >> >> loaded via a namespace (and not attached): >> [1] RCurl_1.3-0 tools_2.10.0 XML_2.6-0 >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor Rhoda Kinsella Ph.D. Ensembl Bioinformatician, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton Cambridge CB10 1SD, UK.

ADD COMMENT • link 16.1 years ago Rhoda Kinsella ▴ 660

Login before adding your answer.