biomaRt getSequence fails with genomic coordinates
3
1
Entering edit mode
Ni-Ar ▴ 10
@ni-ar-15663
Last seen 2.6 years ago
Spain/Barcelona

Hi Mike,

I have a general question and what seems to be a bug or something weird.

My general goal would be:

• given a set of genomic coordinates get the genomic DNA sequence, regardless if the element is an exon, intro, coding region, enhancer, whatever.

• I don't need to get flanking regions upstream or downstream

• I'd like to specify the genome assembly (e.g. hg38, hg19, mm10, or mm9)

I fear that biomaRt actually cannot do that, but I'd be awesome if I were to be proven wrong.

For example, let's get 20bp of an intergenic region, that could be an exon or an intron. The coordinates are: chrX:100636100-100636120

biomaRt::getSequence(chromosome = "x",
start = 100636100,
end   = 100636120,
seqType = 'cdna',
type = 'ensembl_gene_id',
mart = ensembl,
verbose = F) -> tmp


As seqType I specified cdna as I read in the manual that that would return a nucleotide sequence. For the type argument I selected ensembl_gene_id just because I was forced to pick one ID. I would expect such query to return a dataframe with only one row containing the 20bp nucleotide sequence. However,

dim(tmp)
[1] 5 2
nchar(tmp\$cdna)
[1] 3768 3796 1025  820  900


Meaning that I get 5 rows each with DNA sequences of different lenght.

Now is there a way to get only the correct genomic DNA sequence?

I hoped that with the seqType argument one could get such thing, so while playing around and when using coding_gene_flank

biomaRt::getSequence(chromosome = 'x',
start = 100636100,
end   = 100636120,
seqType = 'coding_gene_flank',
type = 'ensembl_gene_id',
mart = ensembl,
verbose = F)


and I came across this error message.

Error in .processResults(postRes, mart = mart, sep = sep, fullXmlQuery = fullXmlQuery,  :

The query to the BioMart webservice returned an invalid result: the number of columns in the result table does not equal the number of attributes in the query.
Please report this on the support site at http://support.bioconductor.org


So, as requested here I am.

Thanks!

PS: I'm using biomaRt v 2.42.0.

2
Entering edit mode
Mike Smith ★ 5.7k
@mike-smith
Last seen 1 hour ago
EMBL Heidelberg / de.NBI

You can use the Esembl REST API if you want a query based solution e.g.

library(httr)
library(jsonlite)
library(xml2)

species <- "human"
chrom <- "X"
start <- 100636100
end <- 100636120
region <- paste0(chrom, ":", start, "..", end)

server <- "http://rest.ensembl.org/sequence/region"

r <- GET(paste(server, species, region, sep = "/"), content_type("text/plain"))

print(content(r))


which gives

[1] "CCAGGTAGGGATGGGCACAGG"


However, I'm not sure you can get access lots of different genome builds that way, other than http://grch37.rest.ensembl.org/ I think it's mostly limited to the most recent assembly in Ensembl.

1
Entering edit mode
Mike Smith ★ 5.7k
@mike-smith
Last seen 1 hour ago
EMBL Heidelberg / de.NBI

Thanks for the nicely formatted and structured question!

I wouldn't do this with biomaRt, you're correct that's it's not designed to do this task. It's designed to be gene-centric e.g. if you specify a region it tells you want genes are in the region, so it's actually really hard to get generic sequence information out. What you're actually getting there is the cdna sequence of the 5 transcripts that overlap the region you've specified.

Instead I would suggest using the BSgenome packages. There's a bigger overhead as you have to download the genome sequence for each assembly you're interested in, but once it's installed you're no longer beholden to the stability of the BioMart servers.

Below is an example for hg38, but there are packages for hg19, mm10, etc and they all work in the same way. (You can search for 'BSgenome' on https://www.bioconductor.org/packages/release/data/annotation/ to see all available genomes.)

## Load genome package
library(BSgenome.Hsapiens.UCSC.hg38)
## assign this to a short name for convenience
hg38 <- BSgenome.Hsapiens.UCSC.hg38
## extract sequence between specific coordinates
getSeq(x = hg38, names = "chrX", start = 100636100, end = 100636120)


and we get back

  21-letter "DNAString" instance
seq: CCAGGTAGGGATGGGCACAGG


Note that it's a 21-letter sequence as this includes both the start and end bases.

I'm not actually sure what's causing the error in your second query, I'll take a look.

0
Entering edit mode

Just as a side note, maybe I'd be nice to include this info in the official documentation.

0
Entering edit mode
Ni-Ar ▴ 10
@ni-ar-15663
Last seen 2.6 years ago
Spain/Barcelona

I feared I had to download the genomes. I hoped there would be a "query"-based solution since I'm mostly only fetching short DNA sequences. I'll use the BSgenome package, and thanks for the quick example.

Cheers