Search
Question: Requesting flank sequences (BioMart::Exception)
0
gravatar for pbarros
18 months ago by
pbarros0
pbarros0 wrote:

Hello,

I was trying to obtain the promoter regions of a set of A. thaliana genes (model plant) using  function getBM() from biomaRt but I didn't manage to make it work. First I tried to use getSequences() but apparently it only works with ENSEMBL_MART_ENSEMBL and not plants_mart. I found somewhere on this forum that getBM() could work, but the example on that topic was not what I wanted... 

Basically I would like to use  "gene_flank" and "upstream_flank"  as attributes but I always get this error:

"Error in getBM(attributes = attrs, filters = "ensembl_gene_id", values = c("AT1G64000",  :
  Query ERROR: caught BioMart::Exception::Usage: Requests for flank sequence must be accompanied by an upstream_flank or downstream_flank request"

this is a sample of the code:

AtMart= useMart("plants_mart", host="plants.ensembl.org", dataset= "athaliana_eg_gene")
attributes = listAttributes(AtMart)
filter = listFilters(AtMart)
attrs <- c("ensembl_gene_id",  "ensembl_transcript_id", "gene_flank", "upstream_flank")
p = getBM(attributes = attrs, filters = "ensembl_gene_id", values = c("AT1G64000", "AT3G46520"),  mart = AtMart )

I think that I need to state somewhere the size of the flanking region, but I don't know where... I tried to add it in the values field, but that is only related with filtering.
Does anyone know how to do this or why this error is occurring?

Thanks in advance
Pedro

> sessionInfo()
R version 3.2.5 (2016-04-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu precise (12.04.5 LTS)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] GenomicFeatures_1.22.13 AnnotationDbi_1.32.3    Biobase_2.30.0          GenomicRanges_1.22.4   
 [5] GenomeInfoDb_1.6.3      IRanges_2.4.8           S4Vectors_0.8.11        BiocGenerics_0.16.1    
 [9] biomaRt_2.26.1          BiocInstaller_1.20.3   

loaded via a namespace (and not attached):
 [1] XVector_0.10.0             GenomicAlignments_1.6.3    zlibbioc_1.16.0           
 [4] BiocParallel_1.4.3         tools_3.2.5                SummarizedExperiment_1.0.2
 [7] DBI_0.4-1                  lambda.r_1.1.7             futile.logger_1.4.1       
[10] rtracklayer_1.30.4         futile.options_1.0.0       bitops_1.0-6              
[13] RCurl_1.95-4.8             RSQLite_1.0.0              Biostrings_2.38.4         
[16] Rsamtools_1.22.0           XML_3.98-1.4              

 

 

ADD COMMENTlink modified 17 months ago by Mike Smith2.1k • written 18 months ago by pbarros0
2
gravatar for Hervé Pagès
18 months ago by
Hervé Pagès ♦♦ 13k
United States
Hervé Pagès ♦♦ 13k wrote:

Hi,

I'm not sure why you can't get the promoter regions for your set of A. thaliana genes using biomaRt, I didn't investigate. Just wanted to mention that an alternative to biomaRt is to use the getPromoterSeq() or extractUpstreamSeqs() functions from the GenomicFeatures package. These functions take as input a BSgenome object (containing the chromosome sequences, BSgenome.Athaliana.TAIR.TAIR9 in your case) and a GRanges object (containing the gene or transcript coordinates). The GRanges is typically extracted from a TxDb object (TxDb.Athaliana.BioMart.plantsmart28 in your case). Something like this:

library(BSgenome.Athaliana.TAIR.TAIR9)
genome <- BSgenome.Athaliana.TAIR.TAIR9

library(TxDb.Athaliana.BioMart.plantsmart28)
txdb <- TxDb.Athaliana.BioMart.plantsmart28

seqlevelsStyle(genome) <- seqlevelsStyle(txdb) <- "Ensembl"

## Using the genes coordinates:
gn <- genes(txdb)
extractUpstreamSeqs(genome, gn)
#   A DNAStringSet instance of length 33602
#         width seq                                           names               
#     [1]  1000 ATATTGCTATTTCTGCCAATA...CTTCACTGTCTTCCTCCCTCC AT1G01010
#     [2]  1000 AATATAAAATGCATTTTAATA...TTAGGGTTAAAACAGTAGCCC AT1G01020
#     [3]  1000 TTGAACAAACAGACACGTATT...GAAGCTCACATGACCCGAACC AT1G01030
#     [4]  1000 ATGCGTGTTGGATGCTTACAA...AAAACAGACCAGAAGAGAGAG AT1G01040
#     [5]  1000 CGCGATTCTTTTTGGCAATGA...CTAAGTTTCATTAGATACTAA AT1G01046
#     ...   ... ...
# [33598]  1000 GGGAGACGGCGCCTTTCCAAG...CAAGCCCGGTGAAGAAATAAA ATMG01370
# [33599]  1000 CTCGTCTTGTGTTGCTCAGAC...TCATCAATCGGAAATCAAGAC ATMG01380
# [33600]  1000 CTCTAACATGGAAATGCTTAT...GTTCTACCACTGCAGGGCCCA ATMG01390
# [33601]  1000 TCATTCCGAAGAACACTTGCC...TCTTTATATACCGAGGATTTG ATMG01400
# [33602]  1000 GTTCTCTGCTTCGTGCTTGGC...ATAGGTTGTTACACCCCTTCC ATMG01410

## Using the transcript coordinates:
tx <- transcripts(txdb)
names(tx) <- mcols(tx)$tx_name
extractUpstreamSeqs(genome, tx)
#   A DNAStringSet instance of length 41671
#         width seq                                           names               
#     [1]  1000 ATATTGCTATTTCTGCCAATA...CTTCACTGTCTTCCTCCCTCC AT1G01010.1
#     [2]  1000 ATGCGTGTTGGATGCTTACAA...AAAACAGACCAGAAGAGAGAG AT1G01040.1
#     [3]  1000 CGCCAAATGGACATCAATTCT...CCCTTTTTCTGGGTTTATTCA AT1G01040.2
#     [4]  1000 CGCGATTCTTTTTGGCAATGA...CTAAGTTTCATTAGATACTAA AT1G01046.1
#     [5]  1000 AGTTCTAATTTTGCTTTCTGT...TATTAGTAAAGCAATTAGAAT AT1G01073.1
#     ...   ... ...
# [41667]  1000 TCGACCCGTGCAGTGCTGTAG...CCAGCAGGAGGCCCGCACGAC ATCG01200.1
# [41668]  1000 GATCCGAATGAATCATCTTTT...TATAAGTAATGCAACTATGAA ATCG01210.1
# [41669]  1000 ATCCGATCAATTGCGTAAAGC...GTTCTATTTCTTGATGGGGGA ATCG01220.1
# [41670]  1000 GTGCAAGGCGCTTTGGTGGGA...CACTTAGCTCCATGGAACAAT ATCG01270.1
# [41671]  1000 TTTTTATAGGAACTTCTGTAC...AATGCAATTTAGGAGGAATCA ATCG01280.1

See ?extractUpstreamSeqs and ?getPromoterSeq for more information.

Cheers,

H.

ADD COMMENTlink written 18 months ago by Hervé Pagès ♦♦ 13k
1
gravatar for Mike Smith
17 months ago by
Mike Smith2.1k
EMBL Heidelberg / de.NBI
Mike Smith2.1k wrote:

It looks like you've got a solution to this, but here's an example of how you would obtain it using biomaRt itself.

AtMart <- useMart("plants_mart", host="plants.ensembl.org", dataset= "athaliana_eg_gene")
p <- getBM(attributes = c("ensembl_gene_id", "gene_flank"), 
          filters = c("ensembl_gene_id", "upstream_flank"),
          values = list(c("AT1G64000", "AT3G46520"), 20),  
          mart = AtMart,
          checkFilters = FALSE)

This is the output. You'll notice that the column names are incorrect. I'm not sure why that is at the moment, but it's easy enough to fix.​

> p
       ensembl_gene_id gene_flank
1 CTTCTTTCTTTCCTACAAAT  AT1G64000
2 AAGTTGCTTCTGTTTTCCAG  AT3G46520
ADD COMMENTlink modified 17 months ago • written 17 months ago by Mike Smith2.1k
0
gravatar for pbarros
18 months ago by
pbarros0
pbarros0 wrote:

It worked beautifully for the purpose! I was actually starting to explore GenomicFeatures but didn't reach any conclusion at that that time, so your help came in good time

Thanks !

 

ADD COMMENTlink written 18 months ago by pbarros0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 108 users visited in the last hour