Entering edit mode
steffen@stat.Berkeley.EDU
▴
600
@steffenstatberkeleyedu-2907
Last seen 10.2 years ago
Dear Robert,
Would it be possible to check if there are duplicates in the result
you
obtain via the web? By default biomaRt will retrieve only unique
results,
sometimes when you query over the web results are duplicated. To
remove
these you need to check the unique only checkbox when exporting your
results to a file. Can you let me know if that explains the
difference in
number of transcripts you notice?
Cheers,
Steffen
> Dear mailing list,
>
> I have recently observed a discrepancies in genome annotation
obtained
> via R package biomaRt.
> I wanted to download all ensembl transcripts from the entire mouse
> genome (chromosome 1:19, X, Y MT only).
>
> When I set the filter based on chromosome names I retrieved ~36000
> transcript, please see the code below.
> However by using the web service www.biomart.org I received ~48000
> transcripts for the same genome version and chromosomes.
>
> By comparing these two data frames you could see that the
discrepancies
> in number of transcripts occur only for some chromosomes (3:9 and
X).
> If I specified only two chromosome names (2 and 3) than the number
of
> downloaded transcripts is correct for both of them.
> If I did not set any filter in getBM function and did the filtering
> manually in R, the number of transcripts is correct.
>
> Session info is attached.
>
> Best Regards
> Robert
>
> --
> Robert Ivanek
> Postdoctoral Fellow Schuebeler Group
> Friedrich Miescher Institute
> Maulbeerstrasse 66
> 4058 Basel / Switzerland
> Office phone: +41 61 697 6100
>
>
> R> library("biomaRt")
> R> ensembl <- useMart("ensembl", dataset = "mmusculus_gene_ensembl")
> R> chroms <- c(1:19,"X","Y","MT")
> R> table(getBM(attributes = c("ensembl_transcript_id",
> "chromosome_name", "strand", "transcript_start", "transcript_end"),
> filters = "chromosome_name", values = chroms, mart =
> ensembl)$chromosome_name)
>
> 1 10 11 12 13 14 15 16 17 18 19 2 3
4
> 5 6 7 8 9 MT X Y
> 2507 1869 4364 1501 1630 1624 1404 1522 1865 985 1245 5232 1080
1454
> 845 1209 1487 1129 1031 41 2072 17
>
> R> ens.web <-
read.delim("../../../mart_export.txt",stringsAsFactors=F)
> R> ens.web <- ens.web[ens.web$Chromosome.Name %in% chroms,]
> R> table(ens.web$Chromosome.Name)
>
> 1 10 11 12 13 14 15 16 17 18 19 2 3
4
> 5 6 7 8 9 MT X Y
> 2507 1869 4364 1501 1630 1624 1404 1522 1865 985 1245 5232 2179
3997
> 2822 2524 3919 2021 2163 41 3297 17
>
> R> table(getBM(attributes = c("ensembl_transcript_id",
> "chromosome_name", "strand", "transcript_start", "transcript_end"),
> filters = "chromosome_name", values = c("2","3","MT"), mart =
> ensembl)$chromosome_name)
>
> 2 3 MT
> 5232 2179 41
>
>
> R> ens.r <- getBM(attributes = c("ensembl_transcript_id",
> "chromosome_name", "strand", "transcript_start", "transcript_end"),
mart
> = ensembl)
> R> ens.r <- ens.r[ens.r$chromosome_name %in% chroms,]
> R> table(ens.r$chromosome_name)
>
> 1 10 11 12 13 14 15 16 17 18 19 2 3
4
> 5 6 7 8 9 MT X Y
> 2507 1869 4364 1501 1630 1624 1404 1522 1865 985 1245 5232 2179
3997
> 2822 2524 3919 2021 2163 41 3297 17
>
>
>
> R> sessionInfo()
> R version 2.10.0 (2009-10-26)
> x86_64-unknown-linux-gnu
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=C
>
> [6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
> LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
>
> other attached packages:
> [1] biomaRt_2.2.0
>
> loaded via a namespace (and not attached):
> [1] RCurl_1.3-0 tools_2.10.0 XML_2.6-0
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>