Error in getBM() for Phytozome
1
0
Entering edit mode
ytlin610 ▴ 10
@ytlin610-16767
Last seen 4.7 years ago
East Lansing, Michigan

I am trying to retrieve flanking sequences of target genes in Chlamydomonas reinhardtii from Phytozome with getBM(). I was able to load the Phytozome database successfully with useMart:

   mart=useMart(biomart = 'phytozome_mart_archive', 
            host = "phytozome.jgi.doe.gov",
            dataset = 'Creinhardtii_281')

I was also able to view the list of attributes and filters with listAttributes(mart) and listFilters(mart), and everything looked good.

However, I kept failing to obtain the flanking sequence with getBM():

DE_cht7_tx <- c("Cre01.g041752.t1.1", "Cre01.g054750.t1.2", "Cre02.g089500.t1.2")
getBM(  attributes=c('gene_flank','transcript_name'),
                filters=c('transcript_name_filter','upstream_flank'),
                values=list(DE_cht7_tx, Upstream=500), 
                mart=mart, checkFilters=FALSE)

Here is the error message I received:

Error in .processResults(postRes, mart = mart, sep = sep, fullXmlQuery = fullXmlQuery,  : The query to the BioMart webservice returned an invalid result: the number of columns in the result table does not equal the number of attributes in the query. Please report this on the support site at http://support.bioconductor.org

Notably, the same getBM() code worked fine if I used Ensembl Plants as the source for biomart

mart=useEnsembl(biomart = 'plants_mart', 
            host = "plants.ensembl.org",
            dataset = 'creinhardtii_eg_gene')

The reason I prefer to use Phytozome instead of Ensembl Plants is that I would like to use the transcript name e.g. Cre02.g089500.t1.2 as the input for searching flanking sequence, and Ensembl Plants only has gene name like Cre02.g089500.

Anyone has the same issue when retrieving data from Phytozome with getBM()?

biomaRt getBM Phytozome Chlamydomonas • 1.1k views
ADD COMMENT
0
Entering edit mode

I'll try again later, but at the moment I'm getting an error message on the Phytosome website when I try to run this query in a browser

There was a problem with the request.

If the sever is having problems then biomaRt won't work any better unfortunately.

ADD REPLY
0
Entering edit mode
Mike Smith ★ 6.6k
@mike-smith
Last seen 2 hours ago
EMBL Heidelberg

Ok, so there's actually several issues here.

The first is that you have to use https and port 443 to contact the Phytozome BioMart. The error you're getting is because it's doing some redirection in the background, rather than running the query. I'll update the biomaRt vignette to mention this, and the mart construction you want to use is:

mart <- useMart(biomart = 'phytozome_mart_archive', 
             host = "https://phytozome.jgi.doe.gov",
             dataset = 'Creinhardtii_281', 
             port = 443)

Unfortunately, this still wont work because there's some fundamental issue with BioMart and flanking sequences. You get back:

Query ERROR: caught BioMart::Exception::Usage: Filter upstream_flank NOT FOUND

This has been reported before for Ensembl (https://support.bioconductor.org/p/120429/#120455) and I don't think it will ever be fixed as no one maintains the code for the BioMart engine any more.


Here's a possible solution using both biomaRt and the Ensembl REST API. The general approach is to get the transcript start position from Phytosome, and then retrieve the genomic sequence from Ensembl via the REST API rather than biomaRt.

It assumes the reference genomes are the same at both locations - you should check this. You may also want to check the logic in working out 'upstream' but just subtracting 500bp, I haven't though about whether this should always be a smaller number or should be strand specific.

library(biomaRt)
library(httr)
library(jsonlite)
ibrary(xml2)
library(dplyr)

## set up the mart
mart <- useMart(biomart = 'phytozome_mart_archive', 
             host = "https://phytozome.jgi.doe.gov",
             dataset = 'Creinhardtii_281', 
             port = 443)

## query with biomaRt to get transcript start site
ts_start <- getBM(  attributes = c('transcript_name', 'chr_name1', 'transcript_chrom_start'),
                    filters = c('transcript_name_filter'),
                    values = list(DE_cht7_tx), 
                    mart = mart)

## do some formatting on our BioMart results to get the chromosome
## names and a 500bp window upstream of our transcript start site
chr <- gsub(ts_start$chr_name1, 
     pattern = "chromosome_([0-9]*)", 
     replacement = "\\1")
start <- as.character(as.integer(ts_start$transcript_chrom_start)-500)
end <- ts_start$transcript_chrom_start

## use the formatted data to construct
## a JSON query we can submit to Ensembl REST
regions <- paste0('"', chr, ":", start, "..", end, '"')
body <- paste0('{ "regions" : [', paste0(regions, collapse = ", "), '] }' )

## submit query to ensembl 
server <- "http://rest.ensembl.org"
ext <- "/sequence/region/chlamydomonas_reinhardtii_str_cc_503_cw92_mt_"
r <- POST(paste(server, ext, sep = ""), 
          content_type("application/json"), 
          accept("application/json"), 
          body = body)

## format the output nicely
r2 %>%
  content() %>%
  toJSON() %>%
  fromJSON() %>%
  lapply(FUN = unlist) %>%
  as_tibble()

And the output looks like:

# A tibble: 3 x 4
  seq                             query     id                        molecule
  <chr>                           <chr>     <chr>                     <chr>   
1 GTGTGGGACGTCTTACGTCGCCCGAGGAGA… 2:216640… chromosome:Chlamydomonas… dna     
2 TAGTTACCCTACCGTACGGTCCTTCACGCA… 1:584324… chromosome:Chlamydomonas… dna     
3 TGCGGCGCCGCCGGAGGCCGCGGTGGCGGT… 1:757994… chromosome:Chlamydomonas… dna    
ADD COMMENT
0
Entering edit mode

Hi Mike, thank you very much for your answer. I re-connected to Phytozome with https and port=443as you instructed and it worked! Somehow I didn't get the same error as you did and I was able to obtain the upstream flanking sequences with the original getBM() code I posted:

DE_cht7_tx <- c("Cre01.g041752.t1.1", "Cre01.g054750.t1.2", "Cre02.g089500.t1.2")

agambiaeseq=getBM(attributes=c('gene_flank','transcript_name'),
                                   filters=c('transcript_name_filter','upstream_flank'),
                                   values=list(DE_cht7_tx, Upstream=500), 
                                   mart=mart, checkFilters=FALSE)

as_tibble(agambiaeseq)

# A tibble: 3 x 2
gene_flank                                                                    transcript_name  
<chr>                                                                              <chr>            
1 GCAGCGCGGGCTCAACAGCATGTGCGTGCAGTGCGGGCTGCGGCGGGCGGGGTCCTAGCCGATTGTGTACCGT~ Cre02.g089500.t1~
2 TAGTTACCCTACCGTACGGTCCTTCACGCAGCCTTCACTCACCGTATGTTGAGCAGCAGTGCCAACCAGCCAA~ Cre01.g041752.t1~
3 TGCGGCGCCGCCGGAGGCCGCGGTGGCGGTGGAGGCAGCGGCGGAGGCGGCTCGGCTGGGCCTGGGGTGGGGG~ Cre01.g054750.t1~

The code you provide is still very helpful since I also need to retrieve the sequence downstream of the TSS, and yes, I will try different ranges of +/- bp from TSS (e.g. +/-500bp to +/-2000bp). Those sequences will be the input for enhancer analysis.

Thanks again for the answer, I really appreciate it.

ADD REPLY

Login before adding your answer.

Traffic: 840 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6