Entering edit mode
Ivanek, Robert
▴
100
@ivanek-robert-3765
Last seen 10.2 years ago
Dear mailing list,
I have recently observed a discrepancies in genome annotation obtained
via R package biomaRt.
I wanted to download all ensembl transcripts from the entire mouse
genome (chromosome 1:19, X, Y MT only).
When I set the filter based on chromosome names I retrieved ~36000
transcript, please see the code below.
However by using the web service www.biomart.org I received ~48000
transcripts for the same genome version and chromosomes.
By comparing these two data frames you could see that the
discrepancies
in number of transcripts occur only for some chromosomes (3:9 and X).
If I specified only two chromosome names (2 and 3) than the number of
downloaded transcripts is correct for both of them.
If I did not set any filter in getBM function and did the filtering
manually in R, the number of transcripts is correct.
Session info is attached.
Best Regards
Robert
--
Robert Ivanek
Postdoctoral Fellow Schuebeler Group
Friedrich Miescher Institute
Maulbeerstrasse 66
4058 Basel / Switzerland
Office phone: +41 61 697 6100
R> library("biomaRt")
R> ensembl <- useMart("ensembl", dataset = "mmusculus_gene_ensembl")
R> chroms <- c(1:19,"X","Y","MT")
R> table(getBM(attributes = c("ensembl_transcript_id",
"chromosome_name", "strand", "transcript_start", "transcript_end"),
filters = "chromosome_name", values = chroms, mart =
ensembl)$chromosome_name)
1 10 11 12 13 14 15 16 17 18 19 2 3 4
5 6 7 8 9 MT X Y
2507 1869 4364 1501 1630 1624 1404 1522 1865 985 1245 5232 1080 1454
845 1209 1487 1129 1031 41 2072 17
R> ens.web <-
read.delim("../../../mart_export.txt",stringsAsFactors=F)
R> ens.web <- ens.web[ens.web$Chromosome.Name %in% chroms,]
R> table(ens.web$Chromosome.Name)
1 10 11 12 13 14 15 16 17 18 19 2 3 4
5 6 7 8 9 MT X Y
2507 1869 4364 1501 1630 1624 1404 1522 1865 985 1245 5232 2179 3997
2822 2524 3919 2021 2163 41 3297 17
R> table(getBM(attributes = c("ensembl_transcript_id",
"chromosome_name", "strand", "transcript_start", "transcript_end"),
filters = "chromosome_name", values = c("2","3","MT"), mart =
ensembl)$chromosome_name)
2 3 MT
5232 2179 41
R> ens.r <- getBM(attributes = c("ensembl_transcript_id",
"chromosome_name", "strand", "transcript_start", "transcript_end"),
mart
= ensembl)
R> ens.r <- ens.r[ens.r$chromosome_name %in% chroms,]
R> table(ens.r$chromosome_name)
1 10 11 12 13 14 15 16 17 18 19 2 3 4
5 6 7 8 9 MT X Y
2507 1869 4364 1501 1630 1624 1404 1522 1865 985 1245 5232 2179 3997
2822 2524 3919 2021 2163 41 3297 17
R> sessionInfo()
R version 2.10.0 (2009-10-26)
x86_64-unknown-linux-gnu
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=C
[6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] biomaRt_2.2.0
loaded via a namespace (and not attached):
[1] RCurl_1.3-0 tools_2.10.0 XML_2.6-0