Retrieve basepair position flanking sequences in biomaRt
1
0
Entering edit mode
@floris_brenk-20773
Last seen 5.0 years ago

Im looking for a way to obtain the flanking sequences of a couple of variants. I found this thread which is very similar to my question:

https://support.bioconductor.org/p/89688/

Now, I was wondering is there also an option to do this for chr:bp format? Some of my variants do not have an rsid and I would still like to get the flanking sequences of them.

biomart • 718 views
ADD COMMENT
2
Entering edit mode

Mike Smith will probably have an answer for you that is based on a biomaRt query. But do note that it's easy enough to get flanking sequences from a BSgenome package based on the chr:pos data:

> library(BSgenome.Hsapiens.UCSC.hg38)
## fake SNP locations
> snp.pos <- GRanges(c("chr1","chr14","chr22"), IRanges(c(384938,495483,23292), width =1))
> snp.pos
GRanges object with 3 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1    384938      *
  [2]    chr14    495483      *
  [3]    chr22     23292      *
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths

> flank(snp.pos, 20)
GRanges object with 3 ranges and 0 metadata columns:
      seqnames        ranges strand
         <Rle>     <IRanges>  <Rle>
  [1]     chr1 384918-384937      *
  [2]    chr14 495463-495482      *
  [3]    chr22   23272-23291      *
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths
> getSeq(Hsapiens, flank(snp.pos, 20))
  A DNAStringSet instance of length 3
    width seq
[1]    20 GACCAGCCTGGTCAACATGG
[2]    20 NNNNNNNNNNNNNNNNNNNN
[3]    20 NNNNNNNNNNNNNNNNNNNN
> getSeq(Hsapiens, flank(snp.pos, 20, FALSE))
  A DNAStringSet instance of length 3
    width seq
[1]    20 GAAACCCCGTCTCTATTAAA
[2]    20 NNNNNNNNNNNNNNNNNNNN
[3]    20 NNNNNNNNNNNNNNNNNNNN
> 
ADD REPLY
0
Entering edit mode
Mike Smith ★ 6.5k
@mike-smith
Last seen 5 hours ago
EMBL Heidelberg

If these are putative SNPs i.e. not in dbSNP just inferred from your data, then I don't think you can retrieve arbitrary sequences from BioMart. The various Marts are either transcript-centric or variant-centric, so getting sequence from regions outside those is not really supported. IMO the method James suggests will be much more efficient than trying to query a web resource for sequence.

To be consistent in your workflow, you could always used biomaRt to get the locations of the SNPs that have rsIDs and then use BSgenome approach for everything.

ADD COMMENT

Login before adding your answer.

Traffic: 441 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6