Question

How to subset a BSgenome

0

Entering edit mode

Aditya ▴ 160

@aditya-7667

Last seen 3.1 years ago

Germany

# Load a BSgenome object
bs <-  BSgenome.Mmusculus.UCSC.mm10::Mmusculus

# Extracting a single chromosome works fine
bs[['chr1']]
bs[['chr2']]

# Question: how to take a BSgenome subset?
bs[c('chr1', 'chr2')]
Error in bsgenome[c("chr1", "chr2")] : object of type 'S4' is not subsettable

BSgenome BSgenome.Mmusculus.UCSC.mm10 • 3.3k views

ADD COMMENT • link updated 5.9 years ago by shepherl 4.2k • written 5.9 years ago by Aditya ▴ 160

score 1 · Answer 1 · 2019-08-05

1

Entering edit mode

shepherl 4.2k

@lshep

Last seen 6 hours ago

United States

The [[ method will only work on a single chromosome. You can also use the getSeq function that has more utility for subsetting for sequences. The getSeq will take a list of chromosomes or a GRanges object for subsetting

> getSeq(bs, c("chr1","chr2"))
  A DNAStringSet instance of length 2
        width seq                                           names               
[1] 195471971 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNN chr1
[2] 182113224 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNN chr2

as proof of principle to create a GRanges object to subset and get sequences

> gr =  as(seqinfo(bs)[c("chr1","chr2")], "GRanges")
> gr
GRanges object with 2 ranges and 0 metadata columns:
       seqnames      ranges strand
          <Rle>   <IRanges>  <Rle>
  chr1     chr1 1-195471971      *
  chr2     chr2 1-182113224      *
  -------
  seqinfo: 2 sequences from mm10 genome
> getSeq(bs, gr)
  A DNAStringSet instance of length 2
        width seq                                           names               
[1] 195471971 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNN chr1
[2] 182113224 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNN chr2

ADD COMMENT • link 5.9 years ago shepherl 4.2k

0

Entering edit mode

Thank you Lori :-). Would it be useful/easy to create a subsetter method? Then

Biostrings::vcountPDict(targetseqs, bsgenome) could be easily modified into Biostrings::vcountPDict(targetseqs, bsgenome[c('chr1', 'chr2')])

Or would you suggest using vcountPDict.Biostrings rather than vcountPDict.BSgenome ?

ADD REPLY • link 5.9 years ago Aditya ▴ 160

0

Entering edit mode

How would this be different than what thegetSeq already does as displayed above?

ADD REPLY • link 5.9 years ago shepherl 4.2k

0

Entering edit mode

I agree a [ method for BSgenome objects would be nice. Many years ago the bsapply() functionality was added to the BSgenome package to work around this. Note that bsapply() supports the exclude argument to exclude some chormosomes. See ?bsapply.

Also please note that the BSgenome package defines vmatchPattern()/vcountPattern()/vmatchPDict()/vcountPDict() methods for BSgenome objects. See ?`vmatchPattern,BSgenome-method`. These methods use bsapply() internally and also support the exclude argument. But yes, being able to subset the BSgenome object before passing it to vcountPDict() would be more elegant/intuitive. Unfortunately improving the BSgenome functionality is not high priority at the moment.

H.

ADD REPLY • link 5.8 years ago Hervé Pagès 16k

0

Entering edit mode

And by [ method for BSgenome objects, I mean an endomorphic subsetter i.e. a subsetting operation that returns another BSgenome object. The getSeq-based subsetting that Lori showed you is not endomorphic because it returns the selected sequences in a DNAStringSet object instead of a BSgenome object.

BTW this getSeq-based subsetting can be done more simply with:

> getSeq(bs, c("chr1", "chr2"))
  A DNAStringSet instance of length 2
        width seq                                           names               
[1] 195471971 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNN chr1
[2] 182113224 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNN chr2

No need to use a GRanges object. See ?`getSeq,BSgenome-method`

H.

ADD REPLY • link 5.8 years ago Hervé Pagès 16k