Question

Error in getBM() for Phytozome

0

Entering edit mode

ytlin610 ▴ 10

@ytlin610-16767

Last seen 5.9 years ago

East Lansing, Michigan

I am trying to retrieve flanking sequences of target genes in Chlamydomonas reinhardtii from Phytozome with getBM(). I was able to load the Phytozome database successfully with useMart:

   mart=useMart(biomart = 'phytozome_mart_archive', 
            host = "phytozome.jgi.doe.gov",
            dataset = 'Creinhardtii_281')

I was also able to view the list of attributes and filters with listAttributes(mart) and listFilters(mart), and everything looked good.

However, I kept failing to obtain the flanking sequence with getBM():

DE_cht7_tx <- c("Cre01.g041752.t1.1", "Cre01.g054750.t1.2", "Cre02.g089500.t1.2")
getBM(  attributes=c('gene_flank','transcript_name'),
                filters=c('transcript_name_filter','upstream_flank'),
                values=list(DE_cht7_tx, Upstream=500), 
                mart=mart, checkFilters=FALSE)

Here is the error message I received:

Error in .processResults(postRes, mart = mart, sep = sep, fullXmlQuery = fullXmlQuery,  : The query to the BioMart webservice returned an invalid result: the number of columns in the result table does not equal the number of attributes in the query. Please report this on the support site at http://support.bioconductor.org

Notably, the same getBM() code worked fine if I used Ensembl Plants as the source for biomart

mart=useEnsembl(biomart = 'plants_mart', 
            host = "plants.ensembl.org",
            dataset = 'creinhardtii_eg_gene')

The reason I prefer to use Phytozome instead of Ensembl Plants is that I would like to use the transcript name e.g. Cre02.g089500.t1.2 as the input for searching flanking sequence, and Ensembl Plants only has gene name like Cre02.g089500.

Anyone has the same issue when retrieving data from Phytozome with getBM()?

biomaRt getBM Phytozome Chlamydomonas • 1.6k views

ADD COMMENT • link updated 5.9 years ago by Mike Smith ★ 6.6k • written 5.9 years ago by ytlin610 ▴ 10

0

Entering edit mode

I'll try again later, but at the moment I'm getting an error message on the Phytosome website when I try to run this query in a browser

There was a problem with the request.

If the sever is having problems then biomaRt won't work any better unfortunately.

ADD REPLY • link 5.9 years ago Mike Smith ★ 6.6k

score 0 · Answer 1 · 2020-03-25

Ok, so there's actually several issues here.

The first is that you have to use https and port 443 to contact the Phytozome BioMart. The error you're getting is because it's doing some redirection in the background, rather than running the query. I'll update the biomaRt vignette to mention this, and the mart construction you want to use is:

mart <- useMart(biomart = 'phytozome_mart_archive', 
             host = "https://phytozome.jgi.doe.gov",
             dataset = 'Creinhardtii_281', 
             port = 443)

Unfortunately, this still wont work because there's some fundamental issue with BioMart and flanking sequences. You get back:

Query ERROR: caught BioMart::Exception::Usage: Filter upstream_flank NOT FOUND

This has been reported before for Ensembl (https://support.bioconductor.org/p/120429/#120455) and I don't think it will ever be fixed as no one maintains the code for the BioMart engine any more.

Here's a possible solution using both biomaRt and the Ensembl REST API. The general approach is to get the transcript start position from Phytosome, and then retrieve the genomic sequence from Ensembl via the REST API rather than biomaRt.

It assumes the reference genomes are the same at both locations - you should check this. You may also want to check the logic in working out 'upstream' but just subtracting 500bp, I haven't though about whether this should always be a smaller number or should be strand specific.

library(biomaRt)
library(httr)
library(jsonlite)
ibrary(xml2)
library(dplyr)

## set up the mart
mart <- useMart(biomart = 'phytozome_mart_archive', 
             host = "https://phytozome.jgi.doe.gov",
             dataset = 'Creinhardtii_281', 
             port = 443)

## query with biomaRt to get transcript start site
ts_start <- getBM(  attributes = c('transcript_name', 'chr_name1', 'transcript_chrom_start'),
                    filters = c('transcript_name_filter'),
                    values = list(DE_cht7_tx), 
                    mart = mart)

## do some formatting on our BioMart results to get the chromosome
## names and a 500bp window upstream of our transcript start site
chr <- gsub(ts_start$chr_name1, 
     pattern = "chromosome_([0-9]*)", 
     replacement = "\\1")
start <- as.character(as.integer(ts_start$transcript_chrom_start)-500)
end <- ts_start$transcript_chrom_start

## use the formatted data to construct
## a JSON query we can submit to Ensembl REST
regions <- paste0('"', chr, ":", start, "..", end, '"')
body <- paste0('{ "regions" : [', paste0(regions, collapse = ", "), '] }' )

## submit query to ensembl 
server <- "http://rest.ensembl.org"
ext <- "/sequence/region/chlamydomonas_reinhardtii_str_cc_503_cw92_mt_"
r <- POST(paste(server, ext, sep = ""), 
          content_type("application/json"), 
          accept("application/json"), 
          body = body)

## format the output nicely
r2 %>%
  content() %>%
  toJSON() %>%
  fromJSON() %>%
  lapply(FUN = unlist) %>%
  as_tibble()

And the output looks like:

# A tibble: 3 x 4
  seq                             query     id                        molecule
  <chr>                           <chr>     <chr>                     <chr>   
1 GTGTGGGACGTCTTACGTCGCCCGAGGAGA… 2:216640… chromosome:Chlamydomonas… dna     
2 TAGTTACCCTACCGTACGGTCCTTCACGCA… 1:584324… chromosome:Chlamydomonas… dna     
3 TGCGGCGCCGCCGGAGGCCGCGGTGGCGGT… 1:757994… chromosome:Chlamydomonas… dna