Extract flanking sequence for a list of genes with biomaRt
1
0
Entering edit mode
lessismore ▴ 20
@lessismore
Last seen 19 months ago
Italy

Dear all,

i have a problem extracting flanking sequences with biomaRt. I've checked this post Comment: Requesting flank sequences (BioMart::Exception) and applied what was suggested but still results in the following error:

Any idea?

# what was suggested and ive tried

library(biomaRt)
AtMart <- useEnsemblGenomes("plants_mart", dataset= "athaliana_eg_gene")
p <- getBM(attributes = c("ensembl_gene_id", "gene_flank"), 
           filters = c("ensembl_gene_id", "upstream_flank"),
           values = list(c("AT1G64000", "AT3G46520"), 20),  
           mart = AtMart,
           checkFilters = FALSE)

# error i get
Error in .processResults(postRes, mart = mart, sep = sep, fullXmlQuery = fullXmlQuery,  : 
  Query ERROR: caught BioMart::Exception::Usage: Filter upstream_flank NOT FOUND


> sessionInfo( )
R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=it_IT.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=it_IT.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=it_IT.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=it_IT.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] biomaRt_2.46.2

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6           pillar_1.4.7         dbplyr_2.1.0         compiler_4.0.2       prettyunits_1.1.1   
 [6] tools_4.0.2          progress_1.2.2       bit_4.0.4            tibble_3.0.6         BiocFileCache_1.14.0
[11] RSQLite_2.2.3        memoise_2.0.0        lifecycle_0.2.0      pkgconfig_2.0.3      rlang_0.4.10        
[16] DBI_1.1.1            curl_4.3             yaml_2.2.1           parallel_4.0.2       xfun_0.20           
[21] fastmap_1.1.0        withr_2.4.1          dplyr_1.0.4          stringr_1.4.0        httr_1.4.2          
[26] knitr_1.31           xml2_1.3.2           rappdirs_0.3.3       generics_0.1.0       S4Vectors_0.28.1    
[31] vctrs_0.3.6          askpass_1.1          IRanges_2.24.1       hms_1.0.0            tidyselect_1.1.0    
[36] stats4_4.0.2         bit64_4.0.5          glue_1.4.2           Biobase_2.50.0       R6_2.5.0            
[41] AnnotationDbi_1.52.0 XML_3.99-0.5         purrr_0.3.4          blob_1.2.1           magrittr_2.0.1      
[46] ellipsis_0.3.1       BiocGenerics_0.36.0  assertthat_0.2.1     stringi_1.5.3        openssl_1.4.3       
[51] cachem_1.0.3         crayon_1.4.0
biomaRt • 1.1k views
ADD COMMENT
1
Entering edit mode
Mike Smith ★ 6.5k
@mike-smith
Last seen 12 hours ago
EMBL Heidelberg

This seems to be temperamental on the Ensembl Plants BioMart server. I ran the example code in quick sucession and it worked once, failed once.

> p <- getBM(attributes = c("ensembl_gene_id", "gene_flank"), 
+            filters = c("ensembl_gene_id", "upstream_flank"),
+            values = list(c("AT1G64000", "AT3G46520"), 20),  
+            mart = AtMart,
+            checkFilters = FALSE)
Error in .processResults(postRes, mart = mart, sep = sep, fullXmlQuery = fullXmlQuery,  : 
  Query ERROR: caught BioMart::Exception::Usage: Filter upstream_flank NOT FOUND
> p <- getBM(attributes = c("ensembl_gene_id", "gene_flank"), 
+            filters = c("ensembl_gene_id", "upstream_flank"),
+            values = list(c("AT1G64000", "AT3G46520"), 20),  
+            mart = AtMart,
+            checkFilters = FALSE)
> p
  ensembl_gene_id           gene_flank
1       AT1G64000 CTTCTTTCTTTCCTACAAAT
2       AT3G46520 CAATGAAGAAAACTAACGGC

I don't know why this would be, but if the server isn't responding properly there's not much you can do with biomaRt to make it behave better.


Assuming there's not a BSgenome package for your organism, an alternative might be to use the Ensembl REST API. The function below should allow you to get the upstream flanks for a given set of Ensembl IDs. You can take a look at the Ensembl Documentation for other parameters the API accepts if you want to modify the function.

library(httr)
library(jsonlite)
library(dplyr)

getUpstreamFlank <- function(ensembl_ids, n_bases = 20) {

    server <- "https://rest.ensembl.org"
    ext <- "/sequence/id"
    r <- POST(paste(server, ext, sep = ""), 
              content_type("application/json"), 
              accept("application/json"), 
              body = toJSON(data.frame(ids = ensembl_ids, 
                                       expand_5prime = n_bases), 
                            dataframe = "columns"))

    r %>% content() %>%
        toJSON() %>%
        fromJSON() %>%
        ## the sequence has the flank appended, so we extract the first n bases
        mutate(upstream_flank = substr(seq, 1, n_bases),
               id = unlist(id)) %>%
        select(id, upstream_flank) %>%
        as_tibble()
}

Running it with your examples IDs, we see it gets the same two sequences as the successful biomaRt query.

> getUpstreamFlank(c("AT1G64000", "AT3G46520"), n_bases = 20)
# A tibble: 2 x 2
  id        upstream_flank      
  <chr>     <chr>               
1 AT1G64000 CTTCTTTCTTTCCTACAAAT
2 AT3G46520 CAATGAAGAAAACTAACGGC

I can't make an statements about how well that will scale if you want to do this in bulk for a large number of genes. It may work, but there might also be some better strategies using on-disk versions of the genome.

ADD COMMENT
0
Entering edit mode

Thank you so much for you effort to help me with this! I've tried to perform it repeatedly until it worked. I can only add that if you get it to work the first time, then it keeps working if you run it again and again, also with different queries.

ADD REPLY

Login before adding your answer.

Traffic: 430 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6