Question: How can I extract sequences from a FASTA file for each of the intervals defined in a BED file using R?
0
4.0 years ago by
pertille0
Sweden
pertille0 wrote:

How can I extract sequences from a FASTA file for each of the intervals defined in a BED file using R?
The reference genome used is "Gallus gallus" that can be obtained by:

    source("http://bioconductor.org/biocLite.R")
biocLite("BSgenome.Ggallus.UCSC.galGal4")
library(BSgenome.Ggallus.UCSC.galGal4)

My data file is a result of gRanges package

    library("GenomicRanges")

> olaps
GRanges object with 2141 ranges and 0 metadata columns:
seqnames               ranges strand
<Rle>            <IRanges>  <Rle>
[1]    chr14 [ 1665929,  1673673]      *
[2]    chr14 [ 2587465,  2595209]      *
[3]    chr14 [ 8143785,  8151529]      *
[4]    chr14 [ 9779705,  9787449]      *
[5]    chr14 [10281129, 10288873]      *
...      ...                  ...    ...
[2137]    chr24   [3280553, 3288297]      *
[2138]    chr24   [3330889, 3338633]      *
[2139]    chr24   [3005641, 3015321]      *
[2140]    chr24   [3319273, 3327017]      *
[2141]    chr24   [5549545, 5557289]      *
-------
seqinfo: 31 sequences from an unspecified genome; no seqlengths

That I can transform in data.table

    olaps<- as.data.table(olaps)

Example to be used:

    olaps<-"seqnames    start      end width strand
chr1  1665929  1673673  7745      *
chr1  2587465  2595209  7745      *
chr1  8143785  8151529  7745      *
chr2  9779705  9787449  7745      *
chr2 10281129 10288873  7745      *"
olaps<-read.table(text=olaps,header=T)

Expected outcome:
something like this (fasta format):

    >SEQUENCE_1
ACTGACTAGCATCGCAT...
>SEQUENCE_2
ACGTAGAGAGGGACATA...
>SEQUENCE_3...


I have tried to use this package unsuccessful until now: I got the Reference Genome fasta:

seq = BSgenome::getSeq(BSgenome.Ggallus.UCSC.galGal4, olaps)

And I am trying this:

Biostrings::writeXStringSet(seq, olaps)

With this error message:

Error in .open_output_file(filepath, append, compress, compression_level) :
'filepath' must be a single string

granges R bioconductor • 5.4k views
modified 3.6 years ago • written 4.0 years ago by pertille0
2

Don't worry about making a data.table, it's irrelevant to your work flow here.

It seems like you've extracted the sequences you're interested in

seq = BSgenome::getSeq(BSgenome.Ggallus.UCSC.galGal4, olaps)

and you're just missing the last step

Biostrings::writeXStringSet(seq, "my.fasta")

Before writing the output, you might wish to add names to the sequences,

names(seq) = paste0("SEQUENCE_", seq_along(seq))

Thank you @Martin Morgan, the problem is that when I run this command:

{seq = BSgenome::getSeq(BSgenome.Ggallus.UCSC.galGal4, olaps)}

Error in loadFUN(x, seqname, ranges) :
trying to load regions beyond the boundaries of non-circular sequence "chr4"

What is weird, because this olaps is a result of CNV call from CNV-seq package and comes from the same reference genome...

1

The software isn't lying; some of your ranges are outside the sequence lengths of the BSgenome object. You're right that the most common cause for this is the use of different reference genomes. Other problems might be use of 0-based versus 1-based coordinates (Bioconductor uses 1-based coordinates; did you import your BED file using rtracklayer::import()?) or a similar but not identical genome build, especially for mitochondrial or similar chromosomes. You can get the genomic ranges of the genome in the BSgenome object with

genomerngs = as(seqinfo(BSgenome.Ggallus.UCSC.galGal4), "GRanges")

and use this to find the overlapping ranges that are out of bounds

olaps[!olaps %within% genomerngs]

If you have the reference genome FASTA file used for copy number calling, it is relatively easy to use it instead of a BSgenome package. See C: problem creating txdb object and bsgenome for solanum pennellii.

Thank you @Martin
Do you know why this error?
The command looks like correct according to the manual:
http://finzi.psych.upenn.edu/library/GenomicRanges/html/GRanges-class.html

genomerngs = as(seqinfo(BSgenome.Ggallus.UCSC.galGal4, "GRanges"))

Error in .identC(Class, "double") :
argument "Class" is missing, with no default

Because there was a typo in the placement of parentheses, now fixed ---as(seqinfo(BSgenome.Ggallus.UCSC.galGal4), "GRanges")

I further updated my response to use %within% rather than %in%.

Answer: How can I extract sequences from a FASTA file for each of the intervals defined
1
4.0 years ago by
Mike Smith4.0k
EMBL Heidelberg / de.NBI
Mike Smith4.0k wrote:

It looks like you're fairly close to what you want to get done - you should be able to do it using GenomicRanges and the appropriate BSgenome package.

I would suggest looking at the GenomicRanges vignette here (http://bioconductor.org/packages/3.3/bioc/vignettes/GenomicRanges/inst/doc/GenomicRangesHOWTOs.pdf), specifically section 2.14.  This has an example of create a GRanges object specifying regions of interest (which you've already done) and then how to apply that with a BSgenome object.  You will then be able to write the result out to a fasta file.