Question

Requesting flank sequences (BioMart::Exception)

0

Entering edit mode

pbarros • 0

@pbarros-10778

Last seen 4.0 years ago

Hello,

I was trying to obtain the promoter regions of a set of A. thaliana genes (model plant) using function getBM() from biomaRt but I didn't manage to make it work. First I tried to use getSequences() but apparently it only works with ENSEMBL_MART_ENSEMBL and not plants_mart. I found somewhere on this forum that getBM() could work, but the example on that topic was not what I wanted...

Basically I would like to use "gene_flank" and "upstream_flank" as attributes but I always get this error:

"Error in getBM(attributes = attrs, filters = "ensembl_gene_id", values = c("AT1G64000", :
Query ERROR: caught BioMart::Exception::Usage: Requests for flank sequence must be accompanied by an upstream_flank or downstream_flank request"

this is a sample of the code:

AtMart= useMart("plants_mart", host="plants.ensembl.org", dataset= "athaliana_eg_gene")
attributes = listAttributes(AtMart)
filter = listFilters(AtMart)
attrs <- c("ensembl_gene_id",  "ensembl_transcript_id", "gene_flank", "upstream_flank")
p = getBM(attributes = attrs, filters = "ensembl_gene_id", values = c("AT1G64000", "AT3G46520"),  mart = AtMart )

I think that I need to state somewhere the size of the flanking region, but I don't know where... I tried to add it in the values field, but that is only related with filtering.
Does anyone know how to do this or why this error is occurring?

Thanks in advance
Pedro

> sessionInfo()
R version 3.2.5 (2016-04-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu precise (12.04.5 LTS)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] GenomicFeatures_1.22.13 AnnotationDbi_1.32.3    Biobase_2.30.0          GenomicRanges_1.22.4   
 [5] GenomeInfoDb_1.6.3      IRanges_2.4.8           S4Vectors_0.8.11        BiocGenerics_0.16.1    
 [9] biomaRt_2.26.1          BiocInstaller_1.20.3   

loaded via a namespace (and not attached):
 [1] XVector_0.10.0             GenomicAlignments_1.6.3    zlibbioc_1.16.0           
 [4] BiocParallel_1.4.3         tools_3.2.5                SummarizedExperiment_1.0.2
 [7] DBI_0.4-1                  lambda.r_1.1.7             futile.logger_1.4.1       
[10] rtracklayer_1.30.4         futile.options_1.0.0       bitops_1.0-6              
[13] RCurl_1.95-4.8             RSQLite_1.0.0              Biostrings_2.38.4         
[16] Rsamtools_1.22.0           XML_3.98-1.4

biomart plants_mart • 1.9k views

ADD COMMENT • link updated 3.2 years ago by lessismore ▴ 20 • written 7.9 years ago by pbarros • 0

2

Entering edit mode

Mike Smith ★ 6.5k

@mike-smith

Last seen 1 hour ago

EMBL Heidelberg

It looks like you've got a solution to this, but here's an example of how you would obtain it using biomaRt itself.

AtMart <- useMart("plants_mart", host="plants.ensembl.org", dataset= "athaliana_eg_gene")
p <- getBM(attributes = c("ensembl_gene_id", "gene_flank"), 
          filters = c("ensembl_gene_id", "upstream_flank"),
          values = list(c("AT1G64000", "AT3G46520"), 20),  
          mart = AtMart,
          checkFilters = FALSE)

This is the output. You'll notice that the column names are incorrect. I'm not sure why that is at the moment, but it's easy enough to fix.

> p
       ensembl_gene_id gene_flank
1 CTTCTTTCTTTCCTACAAAT  AT1G64000
2 AAGTTGCTTCTGTTTTCCAG  AT3G46520

ADD COMMENT • link 7.9 years ago Mike Smith ★ 6.5k

0

Entering edit mode

Dear, i've tried to replicate this example and i get an error:

"Error in .processResults(postRes, mart = mart, sep = sep, fullXmlQuery = fullXmlQuery, : Query ERROR: caught BioMart::Exception::Usage: Filter upstream_flank NOT FOUND"

any idea?

ADD REPLY • link 3.2 years ago lessismore ▴ 20

2

Entering edit mode

A slightly updated version of code works for me still:

library(biomaRt)
AtMart <- useEnsemblGenomes("plants_mart", dataset= "athaliana_eg_gene")
p <- getBM(attributes = c("ensembl_gene_id", "gene_flank"), 
           filters = c("ensembl_gene_id", "upstream_flank"),
           values = list(c("AT1G64000", "AT3G46520"), 20),  
           mart = AtMart,
           checkFilters = FALSE)
p
#>   ensembl_gene_id           gene_flank
#> 1       AT1G64000 CTTCTTTCTTTCCTACAAAT
#> 2       AT3G46520 CAATGAAGAAAACTAACGGC

You might get a faster response from others if you start a new thread, rather than adding to one that's 5 years old. It doesn't get pushed to the front page of the site, so it's unlikely many people will see your comments.

ADD REPLY • link 3.2 years ago Mike Smith ★ 6.5k

0

Entering edit mode

Weird, still doenst work for me. I'll follow your tip, thanks

ADD REPLY • link 3.2 years ago lessismore ▴ 20

0

Entering edit mode

pbarros • 0

@pbarros-10778

Last seen 4.0 years ago

It worked beautifully for the purpose! I was actually starting to explore GenomicFeatures but didn't reach any conclusion at that that time, so your help came in good time

Thanks !

ADD COMMENT • link 7.9 years ago pbarros • 0

score 2 · Accepted Answer · 2016-05-26

Hi,

I'm not sure why you can't get the promoter regions for your set of A. thaliana genes using biomaRt, I didn't investigate. Just wanted to mention that an alternative to biomaRt is to use the getPromoterSeq() or extractUpstreamSeqs() functions from the GenomicFeatures package. These functions take as input a BSgenome object (containing the chromosome sequences, BSgenome.Athaliana.TAIR.TAIR9 in your case) and a GRanges object (containing the gene or transcript coordinates). The GRanges is typically extracted from a TxDb object (TxDb.Athaliana.BioMart.plantsmart28 in your case). Something like this:

library(BSgenome.Athaliana.TAIR.TAIR9)
genome <- BSgenome.Athaliana.TAIR.TAIR9

library(TxDb.Athaliana.BioMart.plantsmart28)
txdb <- TxDb.Athaliana.BioMart.plantsmart28

seqlevelsStyle(genome) <- seqlevelsStyle(txdb) <- "Ensembl"

## Using the genes coordinates:
gn <- genes(txdb)
extractUpstreamSeqs(genome, gn)
#   A DNAStringSet instance of length 33602
#         width seq                                           names               
#     [1]  1000 ATATTGCTATTTCTGCCAATA...CTTCACTGTCTTCCTCCCTCC AT1G01010
#     [2]  1000 AATATAAAATGCATTTTAATA...TTAGGGTTAAAACAGTAGCCC AT1G01020
#     [3]  1000 TTGAACAAACAGACACGTATT...GAAGCTCACATGACCCGAACC AT1G01030
#     [4]  1000 ATGCGTGTTGGATGCTTACAA...AAAACAGACCAGAAGAGAGAG AT1G01040
#     [5]  1000 CGCGATTCTTTTTGGCAATGA...CTAAGTTTCATTAGATACTAA AT1G01046
#     ...   ... ...
# [33598]  1000 GGGAGACGGCGCCTTTCCAAG...CAAGCCCGGTGAAGAAATAAA ATMG01370
# [33599]  1000 CTCGTCTTGTGTTGCTCAGAC...TCATCAATCGGAAATCAAGAC ATMG01380
# [33600]  1000 CTCTAACATGGAAATGCTTAT...GTTCTACCACTGCAGGGCCCA ATMG01390
# [33601]  1000 TCATTCCGAAGAACACTTGCC...TCTTTATATACCGAGGATTTG ATMG01400
# [33602]  1000 GTTCTCTGCTTCGTGCTTGGC...ATAGGTTGTTACACCCCTTCC ATMG01410

## Using the transcript coordinates:
tx <- transcripts(txdb)
names(tx) <- mcols(tx)$tx_name
extractUpstreamSeqs(genome, tx)
#   A DNAStringSet instance of length 41671
#         width seq                                           names               
#     [1]  1000 ATATTGCTATTTCTGCCAATA...CTTCACTGTCTTCCTCCCTCC AT1G01010.1
#     [2]  1000 ATGCGTGTTGGATGCTTACAA...AAAACAGACCAGAAGAGAGAG AT1G01040.1
#     [3]  1000 CGCCAAATGGACATCAATTCT...CCCTTTTTCTGGGTTTATTCA AT1G01040.2
#     [4]  1000 CGCGATTCTTTTTGGCAATGA...CTAAGTTTCATTAGATACTAA AT1G01046.1
#     [5]  1000 AGTTCTAATTTTGCTTTCTGT...TATTAGTAAAGCAATTAGAAT AT1G01073.1
#     ...   ... ...
# [41667]  1000 TCGACCCGTGCAGTGCTGTAG...CCAGCAGGAGGCCCGCACGAC ATCG01200.1
# [41668]  1000 GATCCGAATGAATCATCTTTT...TATAAGTAATGCAACTATGAA ATCG01210.1
# [41669]  1000 ATCCGATCAATTGCGTAAAGC...GTTCTATTTCTTGATGGGGGA ATCG01220.1
# [41670]  1000 GTGCAAGGCGCTTTGGTGGGA...CACTTAGCTCCATGGAACAAT ATCG01270.1
# [41671]  1000 TTTTTATAGGAACTTCTGTAC...AATGCAATTTAGGAGGAATCA ATCG01280.1

See ?extractUpstreamSeqs and ?getPromoterSeq for more information.

Cheers,

H.