Dear list,
I am trying to retrieve 5' flanking sequences and 5' utr for several
genes.
Doing this via biomart.org or, respectively, biomarRt yields different
results.
An example, I want to retrieve the 5' flanking sequences (3000 bases)
plus
the 5' utr for the gene with the EntrezID 23704.
My R code:
library(biomaRt)
ensembl <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
myseq <- getSequence(id = c(23704), type = "entrezgene", seqType =
"coding_gene_flank", upstream = 3154, mart = ensembl)
The length of the 5'utr for this gene is exactly is 154, so this query
should really yield 3000 upstream bases plus the 5'utr.
But, doing this via biomart.org, I get the following:
http://www.ensembl.org/Homo_sapiens/Gene/Export?db=core;g=ENSG00000152
049;output=fasta;r=2:223916862-223920353;strand=feature;t=ENST00000281
830;time=1253696359.47541;st=utr5;genomic=5_flanking;_format=HTML
The length of both sequences is 3154, but if you blast them, the do
not
align perfectly.
What do I miss?
Can it be related to the fact that biomaRt is using the dataset
hsapiens_gene_ensembl, version NCBI35
and biomaRt.org is using the Homo Sapiens genes, GRCh37?
Thanks a lot,
Tefina
> sessionInfo()
R version 2.9.1 (2009-06-26)
i386-pc-mingw32
locale:
LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United
Kingdom.1252;LC_MONETARY=English_United
Kingdom.1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] Biostrings_2.12.8 IRanges_1.2.3 biomaRt_2.0.0
loaded via a namespace (and not attached):
[1] Biobase_2.4.1 RCurl_0.98-1 tools_2.9.1 XML_2.5-3
[[alternative HTML version deleted]]
On Wed, Sep 23, 2009 at 5:22 AM, Tefina Paloma
<tefina.paloma@gmail.com>wrote:
> Dear list,
>
> I am trying to retrieve 5' flanking sequences and 5' utr for several
genes.
> Doing this via biomart.org or, respectively, biomarRt yields
different
> results.
>
> An example, I want to retrieve the 5' flanking sequences (3000
bases) plus
> the 5' utr for the gene with the EntrezID 23704.
>
> My R code:
>
> library(biomaRt)
> ensembl <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
> myseq <- getSequence(id = c(23704), type = "entrezgene", seqType =
> "coding_gene_flank", upstream = 3154, mart = ensembl)
>
> The length of the 5'utr for this gene is exactly is 154, so this
query
> should really yield 3000 upstream bases plus the 5'utr.
>
> But, doing this via biomart.org, I get the following:
>
> http://www.ensembl.org/Homo_sapiens/Gene/Export?db=core;g=ENSG000001
52049;output=fasta;r=2:223916862-223920353;strand=feature;t=ENST000002
81830;time=1253696359.47541;st=utr5;genomic=5_flanking;_format=HTML
>
> The length of both sequences is 3154, but if you blast them, the do
not
> align perfectly.
>
>
Do you mean that they are not the same sequence or that they align to
the
genome with a gap? This UTR covers two exons, so your sequence should
align
with a gap.
Sean
> What do I miss?
> Can it be related to the fact that biomaRt is using the dataset
> hsapiens_gene_ensembl, version NCBI35
> and biomaRt.org is using the Homo Sapiens genes, GRCh37?
>
> Thanks a lot,
> Tefina
>
>
>
>
>
> > sessionInfo()
> R version 2.9.1 (2009-06-26)
> i386-pc-mingw32
>
> locale:
> LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United
> Kingdom.1252;LC_MONETARY=English_United
> Kingdom.1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] Biostrings_2.12.8 IRanges_1.2.3 biomaRt_2.0.0
>
> loaded via a namespace (and not attached):
> [1] Biobase_2.4.1 RCurl_0.98-1 tools_2.9.1 XML_2.5-3
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor@stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
[[alternative HTML version deleted]]
Tefina Paloma <tefina.paloma at="" ...=""> writes:
> Can it be related to the fact that biomaRt is using the dataset
> hsapiens_gene_ensembl, version NCBI35
> and biomaRt.org is using the Homo Sapiens genes, GRCh37?
>
Just a little correction,
the dataset hsapiens_gene_ensembl in biomaRt is version NCBI36 (not
35),
Sean Davis <seandavi at="" ...=""> writes:
> Do you mean that they are not the same sequence or that they align
to the
> genome with a gap? This UTR covers two exons, so your sequence
should align
> with a gap.
>
> Sean
>
As far as I understand, the sequences should align to each other
perfectly.
Is this right?
Best,
Tefina