Question: How can I extract sequences from a FASTA file for each of the intervals defined in a BED file using R?
gravatar for pertille
2.7 years ago by
pertille0 wrote:

How can I extract sequences from a FASTA file for each of the intervals defined in a BED file using R?
The reference genome used is "Gallus gallus" that can be obtained by:


My data file is a result of gRanges package 


    > olaps
    GRanges object with 2141 ranges and 0 metadata columns:
             seqnames               ranges strand
                <Rle>            <IRanges>  <Rle>
         [1]    chr14 [ 1665929,  1673673]      *
         [2]    chr14 [ 2587465,  2595209]      *
         [3]    chr14 [ 8143785,  8151529]      *
         [4]    chr14 [ 9779705,  9787449]      *
         [5]    chr14 [10281129, 10288873]      *
         ...      ...                  ...    ...
      [2137]    chr24   [3280553, 3288297]      *
      [2138]    chr24   [3330889, 3338633]      *
      [2139]    chr24   [3005641, 3015321]      *
      [2140]    chr24   [3319273, 3327017]      *
      [2141]    chr24   [5549545, 5557289]      *
      seqinfo: 31 sequences from an unspecified genome; no seqlengths

That I can transform in data.table


Example to be used:

    olaps<-"seqnames    start      end width strand
    chr1  1665929  1673673  7745      *
    chr1  2587465  2595209  7745      *
    chr1  8143785  8151529  7745      *
    chr2  9779705  9787449  7745      *
    chr2 10281129 10288873  7745      *"

Expected outcome:
something like this (fasta format):


I have tried to use this package unsuccessful until now: I got the Reference Genome fasta:

seq = BSgenome::getSeq(BSgenome.Ggallus.UCSC.galGal4, olaps)

And I am trying this:

Biostrings::writeXStringSet(seq, olaps)

With this error message:

Error in .open_output_file(filepath, append, compress, compression_level) :
  'filepath' must be a single string


ADD COMMENTlink modified 2.3 years ago • written 2.7 years ago by pertille0

Don't worry about making a data.table, it's irrelevant to your work flow here.

It seems like you've extracted the sequences you're interested in 

seq = BSgenome::getSeq(BSgenome.Ggallus.UCSC.galGal4, olaps)

and you're just missing the last step

Biostrings::writeXStringSet(seq, "my.fasta")

Before writing the output, you might wish to add names to the sequences, 

names(seq) = paste0("SEQUENCE_", seq_along(seq))


ADD REPLYlink written 2.7 years ago by Martin Morgan ♦♦ 22k

Thank you @Martin Morgan, the problem is that when I run this command:

{seq = BSgenome::getSeq(BSgenome.Ggallus.UCSC.galGal4, olaps)}

The follow error message appear:

Error in loadFUN(x, seqname, ranges) :
  trying to load regions beyond the boundaries of non-circular sequence "chr4"

What is weird, because this olaps is a result of CNV call from CNV-seq package and comes from the same reference genome...


ADD REPLYlink written 2.7 years ago by pertille0

The software isn't lying; some of your ranges are outside the sequence lengths of the BSgenome object. You're right that the most common cause for this is the use of different reference genomes. Other problems might be use of 0-based versus 1-based coordinates (Bioconductor uses 1-based coordinates; did you import your BED file using rtracklayer::import()?) or a similar but not identical genome build, especially for mitochondrial or similar chromosomes. You can get the genomic ranges of the genome in the BSgenome object with

genomerngs = as(seqinfo(BSgenome.Ggallus.UCSC.galGal4), "GRanges")

and use this to find the overlapping ranges that are out of bounds

olaps[!olaps %within% genomerngs]

If you have the reference genome FASTA file used for copy number calling, it is relatively easy to use it instead of a BSgenome package. See C: problem creating txdb object and bsgenome for solanum pennellii.

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by Martin Morgan ♦♦ 22k

Thank you @Martin
Do you know why this error?
The command looks like correct according to the manual:

genomerngs = as(seqinfo(BSgenome.Ggallus.UCSC.galGal4, "GRanges"))

Error in .identC(Class, "double") :
  argument "Class" is missing, with no default


ADD REPLYlink written 2.7 years ago by pertille0

Because there was a typo in the placement of parentheses, now fixed ---as(seqinfo(BSgenome.Ggallus.UCSC.galGal4), "GRanges")

ADD REPLYlink written 2.7 years ago by Martin Morgan ♦♦ 22k

I further updated my response to use %within% rather than %in%.

ADD REPLYlink written 2.7 years ago by Martin Morgan ♦♦ 22k
gravatar for Mike Smith
2.7 years ago by
Mike Smith2.9k
EMBL Heidelberg / de.NBI
Mike Smith2.9k wrote:

It looks like you're fairly close to what you want to get done - you should be able to do it using GenomicRanges and the appropriate BSgenome package.

I would suggest looking at the GenomicRanges vignette here (, specifically section 2.14.  This has an example of create a GRanges object specifying regions of interest (which you've already done) and then how to apply that with a BSgenome object.  You will then be able to write the result out to a fasta file.

ADD COMMENTlink written 2.7 years ago by Mike Smith2.9k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 359 users visited in the last hour