Search
Question: How can I extract sequences from a FASTA file for each of the intervals defined in a BED file using R?
0
gravatar for pertille
18 months ago by
pertille0
Sweden
pertille0 wrote:

How can I extract sequences from a FASTA file for each of the intervals defined in a BED file using R?
The reference genome used is "Gallus gallus" that can be obtained by:

    source("http://bioconductor.org/biocLite.R")
    biocLite("BSgenome.Ggallus.UCSC.galGal4")
        library(BSgenome.Ggallus.UCSC.galGal4)

My data file is a result of gRanges package 

    library("GenomicRanges")

    > olaps
    GRanges object with 2141 ranges and 0 metadata columns:
             seqnames               ranges strand
                <Rle>            <IRanges>  <Rle>
         [1]    chr14 [ 1665929,  1673673]      *
         [2]    chr14 [ 2587465,  2595209]      *
         [3]    chr14 [ 8143785,  8151529]      *
         [4]    chr14 [ 9779705,  9787449]      *
         [5]    chr14 [10281129, 10288873]      *
         ...      ...                  ...    ...
      [2137]    chr24   [3280553, 3288297]      *
      [2138]    chr24   [3330889, 3338633]      *
      [2139]    chr24   [3005641, 3015321]      *
      [2140]    chr24   [3319273, 3327017]      *
      [2141]    chr24   [5549545, 5557289]      *
      -------
      seqinfo: 31 sequences from an unspecified genome; no seqlengths

That I can transform in data.table

    olaps<- as.data.table(olaps)

Example to be used:

    olaps<-"seqnames    start      end width strand
    chr1  1665929  1673673  7745      *
    chr1  2587465  2595209  7745      *
    chr1  8143785  8151529  7745      *
    chr2  9779705  9787449  7745      *
    chr2 10281129 10288873  7745      *"
    olaps<-read.table(text=olaps,header=T)

Expected outcome:
something like this (fasta format):

    >SEQUENCE_1
    ACTGACTAGCATCGCAT...
    >SEQUENCE_2
    ACGTAGAGAGGGACATA...
    >SEQUENCE_3...

I have tried to use this package unsuccessful until now: I got the Reference Genome fasta:

seq = BSgenome::getSeq(BSgenome.Ggallus.UCSC.galGal4, olaps)

And I am trying this:

Biostrings::writeXStringSet(seq, olaps)

With this error message:

Error in .open_output_file(filepath, append, compress, compression_level) :
  'filepath' must be a single string

 

ADD COMMENTlink modified 13 months ago • written 18 months ago by pertille0
2

Don't worry about making a data.table, it's irrelevant to your work flow here.

It seems like you've extracted the sequences you're interested in 

seq = BSgenome::getSeq(BSgenome.Ggallus.UCSC.galGal4, olaps)

and you're just missing the last step

Biostrings::writeXStringSet(seq, "my.fasta")

Before writing the output, you might wish to add names to the sequences, 

names(seq) = paste0("SEQUENCE_", seq_along(seq))

 

ADD REPLYlink written 18 months ago by Martin Morgan ♦♦ 20k

Thank you @Martin Morgan, the problem is that when I run this command:

{seq = BSgenome::getSeq(BSgenome.Ggallus.UCSC.galGal4, olaps)}

The follow error message appear:

Error in loadFUN(x, seqname, ranges) :
  trying to load regions beyond the boundaries of non-circular sequence "chr4"

What is weird, because this olaps is a result of CNV call from CNV-seq package and comes from the same reference genome...
 

 

ADD REPLYlink written 18 months ago by pertille0
1

The software isn't lying; some of your ranges are outside the sequence lengths of the BSgenome object. You're right that the most common cause for this is the use of different reference genomes. Other problems might be use of 0-based versus 1-based coordinates (Bioconductor uses 1-based coordinates; did you import your BED file using rtracklayer::import()?) or a similar but not identical genome build, especially for mitochondrial or similar chromosomes. You can get the genomic ranges of the genome in the BSgenome object with

genomerngs = as(seqinfo(BSgenome.Ggallus.UCSC.galGal4), "GRanges")

and use this to find the overlapping ranges that are out of bounds

olaps[!olaps %within% genomerngs]

If you have the reference genome FASTA file used for copy number calling, it is relatively easy to use it instead of a BSgenome package. See C: problem creating txdb object and bsgenome for solanum pennellii.

ADD REPLYlink modified 18 months ago • written 18 months ago by Martin Morgan ♦♦ 20k

Thank you @Martin
Do you know why this error?
The command looks like correct according to the manual:
http://finzi.psych.upenn.edu/library/GenomicRanges/html/GRanges-class.html

genomerngs = as(seqinfo(BSgenome.Ggallus.UCSC.galGal4, "GRanges"))

Error in .identC(Class, "double") :
  argument "Class" is missing, with no default

 

ADD REPLYlink written 18 months ago by pertille0

Because there was a typo in the placement of parentheses, now fixed ---as(seqinfo(BSgenome.Ggallus.UCSC.galGal4), "GRanges")

ADD REPLYlink written 18 months ago by Martin Morgan ♦♦ 20k

I further updated my response to use %within% rather than %in%.

ADD REPLYlink written 18 months ago by Martin Morgan ♦♦ 20k
1
gravatar for Mike Smith
18 months ago by
Mike Smith1.9k
EMBL Heidelberg / de.NBI
Mike Smith1.9k wrote:

It looks like you're fairly close to what you want to get done - you should be able to do it using GenomicRanges and the appropriate BSgenome package.

I would suggest looking at the GenomicRanges vignette here (http://bioconductor.org/packages/3.3/bioc/vignettes/GenomicRanges/inst/doc/GenomicRangesHOWTOs.pdf), specifically section 2.14.  This has an example of create a GRanges object specifying regions of interest (which you've already done) and then how to apply that with a BSgenome object.  You will then be able to write the result out to a fasta file.

ADD COMMENTlink written 18 months ago by Mike Smith1.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 283 users visited in the last hour