biomaRt vs biomart.org
4
0
Entering edit mode
Tefina Paloma ▴ 220
@tefina-paloma-3676
Last seen 9.6 years ago
Dear list, I am trying to retrieve 5' flanking sequences and 5' utr for several genes. Doing this via biomart.org or, respectively, biomarRt yields different results. An example, I want to retrieve the 5' flanking sequences (3000 bases) plus the 5' utr for the gene with the EntrezID 23704. My R code: library(biomaRt) ensembl <- useMart("ensembl", dataset = "hsapiens_gene_ensembl") myseq <- getSequence(id = c(23704), type = "entrezgene", seqType = "coding_gene_flank", upstream = 3154, mart = ensembl) The length of the 5'utr for this gene is exactly is 154, so this query should really yield 3000 upstream bases plus the 5'utr. But, doing this via biomart.org, I get the following: http://www.ensembl.org/Homo_sapiens/Gene/Export?db=core;g=ENSG00000152 049;output=fasta;r=2:223916862-223920353;strand=feature;t=ENST00000281 830;time=1253696359.47541;st=utr5;genomic=5_flanking;_format=HTML The length of both sequences is 3154, but if you blast them, the do not align perfectly. What do I miss? Can it be related to the fact that biomaRt is using the dataset hsapiens_gene_ensembl, version NCBI35 and biomaRt.org is using the Homo Sapiens genes, GRCh37? Thanks a lot, Tefina > sessionInfo() R version 2.9.1 (2009-06-26) i386-pc-mingw32 locale: LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom.1252;LC_MONETARY=English_United Kingdom.1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] Biostrings_2.12.8 IRanges_1.2.3 biomaRt_2.0.0 loaded via a namespace (and not attached): [1] Biobase_2.4.1 RCurl_0.98-1 tools_2.9.1 XML_2.5-3 [[alternative HTML version deleted]]
Homo sapiens biomaRt Homo sapiens biomaRt • 1.6k views
ADD COMMENT
0
Entering edit mode
@sean-davis-490
Last seen 12 weeks ago
United States
On Wed, Sep 23, 2009 at 5:22 AM, Tefina Paloma <tefina.paloma@gmail.com>wrote: > Dear list, > > I am trying to retrieve 5' flanking sequences and 5' utr for several genes. > Doing this via biomart.org or, respectively, biomarRt yields different > results. > > An example, I want to retrieve the 5' flanking sequences (3000 bases) plus > the 5' utr for the gene with the EntrezID 23704. > > My R code: > > library(biomaRt) > ensembl <- useMart("ensembl", dataset = "hsapiens_gene_ensembl") > myseq <- getSequence(id = c(23704), type = "entrezgene", seqType = > "coding_gene_flank", upstream = 3154, mart = ensembl) > > The length of the 5'utr for this gene is exactly is 154, so this query > should really yield 3000 upstream bases plus the 5'utr. > > But, doing this via biomart.org, I get the following: > > http://www.ensembl.org/Homo_sapiens/Gene/Export?db=core;g=ENSG000001 52049;output=fasta;r=2:223916862-223920353;strand=feature;t=ENST000002 81830;time=1253696359.47541;st=utr5;genomic=5_flanking;_format=HTML > > The length of both sequences is 3154, but if you blast them, the do not > align perfectly. > > Do you mean that they are not the same sequence or that they align to the genome with a gap? This UTR covers two exons, so your sequence should align with a gap. Sean > What do I miss? > Can it be related to the fact that biomaRt is using the dataset > hsapiens_gene_ensembl, version NCBI35 > and biomaRt.org is using the Homo Sapiens genes, GRCh37? > > Thanks a lot, > Tefina > > > > > > > sessionInfo() > R version 2.9.1 (2009-06-26) > i386-pc-mingw32 > > locale: > LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United > Kingdom.1252;LC_MONETARY=English_United > Kingdom.1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] Biostrings_2.12.8 IRanges_1.2.3 biomaRt_2.0.0 > > loaded via a namespace (and not attached): > [1] Biobase_2.4.1 RCurl_0.98-1 tools_2.9.1 XML_2.5-3 > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
ADD COMMENT
0
Entering edit mode
Tefina Paloma ▴ 220
@tefina-paloma-3676
Last seen 9.6 years ago
Tefina Paloma <tefina.paloma at="" ...=""> writes: > Can it be related to the fact that biomaRt is using the dataset > hsapiens_gene_ensembl, version NCBI35 > and biomaRt.org is using the Homo Sapiens genes, GRCh37? > Just a little correction, the dataset hsapiens_gene_ensembl in biomaRt is version NCBI36 (not 35),
ADD COMMENT
0
Entering edit mode
Tefina Paloma ▴ 220
@tefina-paloma-3676
Last seen 9.6 years ago
Sean Davis <seandavi at="" ...=""> writes: > Do you mean that they are not the same sequence or that they align to the > genome with a gap? This UTR covers two exons, so your sequence should align > with a gap. > > Sean > If I put as query sequence the sequence from biomart.org, and as subject sequence the sequence from biomaRt, the alignment is like below >lcl|2591 Length=3154 Score = 4798 bits (2598), Expect = 0.0 Identities = 2598/2598 (100%), Gaps = 0/2598 (0%) Strand=Plus/Plus Query 534 GCAACCTGAAGCCCTTGGGAGCAACAGCGTACTCCTAACAATGACAACTAAACACAGGCA 593 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 1 GCAACCTGAAGCCCTTGGGAGCAACAGCGTACTCCTAACAATGACAACTAAACACAGGCA 60 Query 594 CTGAGCATGTGCATTTGGCCAGACATGGTGCTTTCTTTGCATCATTTCATTGAActattt 653 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 61 CTGAGCATGTGCATTTGGCCAGACATGGTGCTTTCTTTGCATCATTTCATTGAACTATTT 120 Query 654 tattctgttctgttctattctattctattctattctattctattctatttattTAGAGAT 713 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 121 TATTCTGTTCTGTTCTATTCTATTCTATTCTATTCTATTCTATTCTATTTATTTAGAGAT 180 Query 714 CTCGCTCTGTCACCCAGGCTGGAGTGTAGTGGCATGTTCAGACCTCATTGCAGCCTTGAA 773 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 181 CTCGCTCTGTCACCCAGGCTGGAGTGTAGTGGCATGTTCAGACCTCATTGCAGCCTTGAA 240 Query 774 CTCCTGGTCTCGAGTGATCCTCCCACCCCAGCCTCCCAAGTAGCTGGGACTACAGGCACT 833 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 241 CTCCTGGTCTCGAGTGATCCTCCCACCCCAGCCTCCCAAGTAGCTGGGACTACAGGCACT 300 Query 834 CGCCACCAGGCCTAGTTAATTTTTGTAtttttttGTAGAGATGGGGTCTCACTGTGTTGC 893 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 301 CGCCACCAGGCCTAGTTAATTTTTGTATTTTTTTGTAGAGATGGGGTCTCACTGTGTTGC 360 Query 894 CCACGCTGGTCTCAAACACCTGGGTTCAAGTGATTCATCCACCTCAGCCTCTTCAAGCAT 953 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 361 CCACGCTGGTCTCAAACACCTGGGTTCAAGTGATTCATCCACCTCAGCCTCTTCAAGCAT 420 Query 954 TGGGATTACTGAACTAAGACACTGCAGTTGGCCTCGTTTAACTCTAGTAGAAATATCCAT 1013 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 421 TGGGATTACTGAACTAAGACACTGCAGTTGGCCTCGTTTAACTCTAGTAGAAATATCCAT 480 Query 1014 GCAGGAAGTATGTGGGAATCGGGGCAGCAGGGACTCCAAGCAGGCACCCCAGAATTTCTT 1073 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 481 GCAGGAAGTATGTGGGAATCGGGGCAGCAGGGACTCCAAGCAGGCACCCCAGAATTTCTT 540 Query 1074 CTGGGCTGTTCCTTCCCTGACTCCTGCAATTAGTCCTGCTTTTCCTTTGGCTCTGACTTG 1133 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 541 CTGGGCTGTTCCTTCCCTGACTCCTGCAATTAGTCCTGCTTTTCCTTTGGCTCTGACTTG 600 Query 1134 CTTCGTCCTTTGGAATTCATTCTCGATGTTTCCCCACACTCATCTCTTTTCTTGGTTGTA 1193 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 601 CTTCGTCCTTTGGAATTCATTCTCGATGTTTCCCCACACTCATCTCTTTTCTTGGTTGTA 660 Query 1194 TTCCCTTGGGACTGTTGGCTCAGGTTTGGGGATTTATTATGTTTAAAACTTCAGCCTCTG 1253 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 661 TTCCCTTGGGACTGTTGGCTCAGGTTTGGGGATTTATTATGTTTAAAACTTCAGCCTCTG 720 Query 1254 TTTGGCTTCCTGGCACCAGGCTTTGTACTTCCTGCTCCTTGAATCTGGTAACTCCTATCC 1313 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 721 TTTGGCTTCCTGGCACCAGGCTTTGTACTTCCTGCTCCTTGAATCTGGTAACTCCTATCC 780 Query 1314 CCACCTCCTTTCTGCCTACTCAAAGCTTCCAGTCTTTGGTGTTGGACAATCCCTGGATGA 1373 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 781 CCACCTCCTTTCTGCCTACTCAAAGCTTCCAGTCTTTGGTGTTGGACAATCCCTGGATGA 840 Query 1374 TGACCAATCTCGTATGTCCTAAGGTATACAATAAAAAATACCAGGGTCAACAATCAACAG 1433 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 841 TGACCAATCTCGTATGTCCTAAGGTATACAATAAAAAATACCAGGGTCAACAATCAACAG 900 Query 1434 GCATCTCTTTCTTGGGCCCATCTTGTTCTAGTGTCCCAGACATTCCAGTGTAGGCTTAGA 1493 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 901 GCATCTCTTTCTTGGGCCCATCTTGTTCTAGTGTCCCAGACATTCCAGTGTAGGCTTAGA 960 Query 1494 TATAGATGGAAGTGTTCTAGTGTTTATGATGGACACCTGTTGAAAAGACCAAGTCTACCA 1553 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 961 TATAGATGGAAGTGTTCTAGTGTTTATGATGGACACCTGTTGAAAAGACCAAGTCTACCA 1020 Query 1554 TGGCTGAGGTAGCTATGGAGGGTTTTACGTATTAACACAATGGTGAGGGTATCTTTACTG 1613 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 1021 TGGCTGAGGTAGCTATGGAGGGTTTTACGTATTAACACAATGGTGAGGGTATCTTTACTG 1080 Query 1614 GTGTGAGCACAGTTCCACTGTATGGATGATCGTGATGCTGGAGTGGTCGATGGTTGGTAC 1673 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 1081 GTGTGAGCACAGTTCCACTGTATGGATGATCGTGATGCTGGAGTGGTCGATGGTTGGTAC 1140 Query 1674 CTCCAGTGCCAGCTGGGGATTTATGGATGAACACAGGTGAGTAGTCAAGTGGGAAAAATG 1733 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 1141 CTCCAGTGCCAGCTGGGGATTTATGGATGAACACAGGTGAGTAGTCAAGTGGGAAAAATG 1200 Query 1734 GCAGCATTCAGTTCATCTTCCTATTCTTCCTCCAGGTGTCTTCTTAGAATCAGGATCAGG 1793 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 1201 GCAGCATTCAGTTCATCTTCCTATTCTTCCTCCAGGTGTCTTCTTAGAATCAGGATCAGG 1260 Query 1794 TGCAAACCCAGGGGGGTTCCTGTAGCAGCAGTGAAAATTCCAGTGCCTAAGCTATATATG 1853 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 1261 TGCAAACCCAGGGGGGTTCCTGTAGCAGCAGTGAAAATTCCAGTGCCTAAGCTATATATG 1320 Query 1854 TTCAAGCAGGTCAGGTGGATGTCGCATGCGTCAGTTTGACTACAGCAGAACCATGAGAGA 1913 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 1321 TTCAAGCAGGTCAGGTGGATGTCGCATGCGTCAGTTTGACTACAGCAGAACCATGAGAGA 1380 Query 1914 TGTTTCCTTTAGAGTTGGCCCACAAGACAGTCTGGCTGCAATCCACAGGCCACAGACAAC 1973 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 1381 TGTTTCCTTTAGAGTTGGCCCACAAGACAGTCTGGCTGCAATCCACAGGCCACAGACAAC 1440 Query 1974 TGGAGGGAGTGGATCTCTCCCAGTTTCCTTCCACTTAGCATGAAAGCCTCAGAATAAGCA 2033 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 1441 TGGAGGGAGTGGATCTCTCCCAGTTTCCTTCCACTTAGCATGAAAGCCTCAGAATAAGCA 1500 Query 2034 GCCCAGGGAGCAGAGAGACTGACATTAAAGCCTGCAATTCCTCTTCCAATTTTGATCACA 2093 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 1501 GCCCAGGGAGCAGAGAGACTGACATTAAAGCCTGCAATTCCTCTTCCAATTTTGATCACA 1560 Query 2094 GCAGCCATTTAAACACAGGGTCTACCGAGGTTTAAAAAACTTGAACTGTGCTTAGTTGCA 2153 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 1561 GCAGCCATTTAAACACAGGGTCTACCGAGGTTTAAAAAACTTGAACTGTGCTTAGTTGCA 1620 Query 2154 CTCTGAAATAGTCCTGCTCCTCCCCTGACCTACGAGAGACAGCAAAGAGACGTGTCAATA 2213 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 1621 CTCTGAAATAGTCCTGCTCCTCCCCTGACCTACGAGAGACAGCAAAGAGACGTGTCAATA 1680 Query 2214 GCCTCCGCATGAGGCTTCAGAGGAGCAGCTGTGTATGGCAGGACGGAACAAAACCTGCCC 2273 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 1681 GCCTCCGCATGAGGCTTCAGAGGAGCAGCTGTGTATGGCAGGACGGAACAAAACCTGCCC 1740 Query 2274 ATAGTATCTTTTACGACAACATGTTTCCACTTAATGCAGACCACTGAAAAGAATGTGGGA 2333 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 1741 ATAGTATCTTTTACGACAACATGTTTCCACTTAATGCAGACCACTGAAAAGAATGTGGGA 1800 Query 2334 GCTTTTaaaaaaaaaTTATTATAAACATAGGTTTGTGACCTTGATGTGGAAGGCAGCTAG 2393 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 1801 GCTTTTAAAAAAAAATTATTATAAACATAGGTTTGTGACCTTGATGTGGAAGGCAGCTAG 1860 Query 2394 AATCTCTGCTTTTAGAGGGCTAAGCAACACCAGGCAGCCTTCAATCTTAGAAGGGTTAAG 2453 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 1861 AATCTCTGCTTTTAGAGGGCTAAGCAACACCAGGCAGCCTTCAATCTTAGAAGGGTTAAG 1920 Query 2454 CTGAAAGGGTCTCAAAAGGTCACGTGGTTTATATAATCCTACCTGCAGAAGAcccccccc 2513 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 1921 CTGAAAGGGTCTCAAAAGGTCACGTGGTTTATATAATCCTACCTGCAGAAGACCCCCCCC 1980 Query 2514 cccGCCAGGCACAACGATTTTACAGACGAGGAATGTGAGGTGCGGAGAGGTTAAGGAAGG 2573 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 1981 CCCGCCAGGCACAACGATTTTACAGACGAGGAATGTGAGGTGCGGAGAGGTTAAGGAAGG 2040 Query 2574 ATTTATCTTATTTGCATAAGGAGTGGAAGAACTGAAACCGAAGCCCCAGTTCCTTGACTG 2633 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2041 ATTTATCTTATTTGCATAAGGAGTGGAAGAACTGAAACCGAAGCCCCAGTTCCTTGACTG 2100 Query 2634 TAAATCCCGCACTTGCTTCCAACTGTCTTTCATCCAGATTATGGGATTCAGCTGCCTCTG 2693 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2101 TAAATCCCGCACTTGCTTCCAACTGTCTTTCATCCAGATTATGGGATTCAGCTGCCTCTG 2160 Query 2694 AAAACCTGTAGCCCAATAATGGTTATTCCCCAGGAGCCGCGCGAAGCATGAGCTAATTTT 2753 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2161 AAAACCTGTAGCCCAATAATGGTTATTCCCCAGGAGCCGCGCGAAGCATGAGCTAATTTT 2220 Query 2754 CAGTGAGCGCGGACTTTGGGGTAACGGTTCCAGCACAGCACATCCCTTTCTCCTCTTTTC 2813 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2221 CAGTGAGCGCGGACTTTGGGGTAACGGTTCCAGCACAGCACATCCCTTTCTCCTCTTTTC 2280 Query 2814 ACTCATCGTCACCGCTACCTGAAAACCCTGGCCGGGTGCTGGGGCTTGAGGAGCAGTTCC 2873 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2281 ACTCATCGTCACCGCTACCTGAAAACCCTGGCCGGGTGCTGGGGCTTGAGGAGCAGTTCC 2340 Query 2874 CACTTCCCAGTCTTTTTCACTTTTCACAGCTGCAAAGTTCAGGGAGTTGAACTGCAGTGC 2933 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2341 CACTTCCCAGTCTTTTTCACTTTTCACAGCTGCAAAGTTCAGGGAGTTGAACTGCAGTGC 2400 Query 2934 TTTCAGTTCACTGCTCACTCTGCCACGATCAATCTCTGTTGTAAATTTTCCTCCCAGAGC 2993 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2401 TTTCAGTTCACTGCTCACTCTGCCACGATCAATCTCTGTTGTAAATTTTCCTCCCAGAGC 2460 Query 2994 ACGTGACGATGCACTTCTTGACTATATATCCCAACTGCAGCAGCGGAGTTGTCAGAGCGC 3053 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2461 ACGTGACGATGCACTTCTTGACTATATATCCCAACTGCAGCAGCGGAGTTGTCAGAGCGC 2520 Query 3054 AGAGCCGGACAGAGCAGAAGAACCCTCTTGGACTGGACGATTTGGGAATTCAAAACTTGG 3113 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 2521 AGAGCCGGACAGAGCAGAAGAACCCTCTTGGACTGGACGATTTGGGAATTCAAAACTTGG 2580 Query 3114 GACAAACTGTCAGCCTTG 3131 |||||||||||||||||| Sbjct 2581 GACAAACTGTCAGCCTTG 2598
ADD COMMENT
0
Entering edit mode
Tefina Paloma ▴ 220
@tefina-paloma-3676
Last seen 9.6 years ago
Sean Davis <seandavi at="" ...=""> writes: > Do you mean that they are not the same sequence or that they align to the > genome with a gap? This UTR covers two exons, so your sequence should align > with a gap. > > Sean > As far as I understand, the sequences should align to each other perfectly. Is this right? Best, Tefina
ADD COMMENT

Login before adding your answer.

Traffic: 879 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6