Question

Retrieve basepair position flanking sequences in biomaRt

0

Entering edit mode

floris_brenk • 0

@floris_brenk-20773

Last seen 5.0 years ago

Im looking for a way to obtain the flanking sequences of a couple of variants. I found this thread which is very similar to my question:

https://support.bioconductor.org/p/89688/

Now, I was wondering is there also an option to do this for chr:bp format? Some of my variants do not have an rsid and I would still like to get the flanking sequences of them.

biomart • 718 views

ADD COMMENT • link updated 5.0 years ago by Mike Smith ★ 6.5k • written 5.0 years ago by floris_brenk • 0

2

Entering edit mode

Mike Smith will probably have an answer for you that is based on a biomaRt query. But do note that it's easy enough to get flanking sequences from a BSgenome package based on the chr:pos data:

> library(BSgenome.Hsapiens.UCSC.hg38)
## fake SNP locations
> snp.pos <- GRanges(c("chr1","chr14","chr22"), IRanges(c(384938,495483,23292), width =1))
> snp.pos
GRanges object with 3 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1    384938      *
  [2]    chr14    495483      *
  [3]    chr22     23292      *
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths

> flank(snp.pos, 20)
GRanges object with 3 ranges and 0 metadata columns:
      seqnames        ranges strand
         <Rle>     <IRanges>  <Rle>
  [1]     chr1 384918-384937      *
  [2]    chr14 495463-495482      *
  [3]    chr22   23272-23291      *
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths
> getSeq(Hsapiens, flank(snp.pos, 20))
  A DNAStringSet instance of length 3
    width seq
[1]    20 GACCAGCCTGGTCAACATGG
[2]    20 NNNNNNNNNNNNNNNNNNNN
[3]    20 NNNNNNNNNNNNNNNNNNNN
> getSeq(Hsapiens, flank(snp.pos, 20, FALSE))
  A DNAStringSet instance of length 3
    width seq
[1]    20 GAAACCCCGTCTCTATTAAA
[2]    20 NNNNNNNNNNNNNNNNNNNN
[3]    20 NNNNNNNNNNNNNNNNNNNN
>

ADD REPLY • link 5.0 years ago James W. MacDonald 65k

score 0 · Answer 1 · 2019-05-17

If these are putative SNPs i.e. not in dbSNP just inferred from your data, then I don't think you can retrieve arbitrary sequences from BioMart. The various Marts are either transcript-centric or variant-centric, so getting sequence from regions outside those is not really supported. IMO the method James suggests will be much more efficient than trying to query a web resource for sequence.

To be consistent in your workflow, you could always used biomaRt to get the locations of the SNPs that have rsIDs and then use BSgenome approach for everything.