How to subset a BSgenome
1
0
Entering edit mode
Aditya ▴ 160
@aditya-7667
Last seen 21 months ago
Germany
# Load a BSgenome object
bs <-  BSgenome.Mmusculus.UCSC.mm10::Mmusculus

# Extracting a single chromosome works fine
bs[['chr1']]
bs[['chr2']]

# Question: how to take a BSgenome subset?
bs[c('chr1', 'chr2')]
Error in bsgenome[c("chr1", "chr2")] : object of type 'S4' is not subsettable
BSgenome BSgenome.Mmusculus.UCSC.mm10 • 2.3k views
ADD COMMENT
1
Entering edit mode
shepherl 3.8k
@lshep
Last seen 1 hour ago
United States

The [[ method will only work on a single chromosome. You can also use the getSeq function that has more utility for subsetting for sequences. The getSeq will take a list of chromosomes or a GRanges object for subsetting

> getSeq(bs, c("chr1","chr2"))
  A DNAStringSet instance of length 2
        width seq                                           names               
[1] 195471971 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNN chr1
[2] 182113224 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNN chr2

as proof of principle to create a GRanges object to subset and get sequences

> gr =  as(seqinfo(bs)[c("chr1","chr2")], "GRanges")
> gr
GRanges object with 2 ranges and 0 metadata columns:
       seqnames      ranges strand
          <Rle>   <IRanges>  <Rle>
  chr1     chr1 1-195471971      *
  chr2     chr2 1-182113224      *
  -------
  seqinfo: 2 sequences from mm10 genome
> getSeq(bs, gr)
  A DNAStringSet instance of length 2
        width seq                                           names               
[1] 195471971 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNN chr1
[2] 182113224 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNN chr2

ADD COMMENT
0
Entering edit mode

Thank you Lori :-). Would it be useful/easy to create a subsetter method? Then

Biostrings::vcountPDict(targetseqs, bsgenome) could be easily modified into Biostrings::vcountPDict(targetseqs, bsgenome[c('chr1', 'chr2')])

Or would you suggest using vcountPDict.Biostrings rather than vcountPDict.BSgenome ?

ADD REPLY
0
Entering edit mode

How would this be different than what thegetSeq already does as displayed above?

ADD REPLY
0
Entering edit mode

I agree a [ method for BSgenome objects would be nice. Many years ago the bsapply() functionality was added to the BSgenome package to work around this. Note that bsapply() supports the exclude argument to exclude some chormosomes. See ?bsapply.

Also please note that the BSgenome package defines vmatchPattern()/vcountPattern()/vmatchPDict()/vcountPDict() methods for BSgenome objects. See ?`vmatchPattern,BSgenome-method`. These methods use bsapply() internally and also support the exclude argument. But yes, being able to subset the BSgenome object before passing it to vcountPDict() would be more elegant/intuitive. Unfortunately improving the BSgenome functionality is not high priority at the moment.

H.

ADD REPLY
0
Entering edit mode

And by [ method for BSgenome objects, I mean an endomorphic subsetter i.e. a subsetting operation that returns another BSgenome object. The getSeq-based subsetting that Lori showed you is not endomorphic because it returns the selected sequences in a DNAStringSet object instead of a BSgenome object.

BTW this getSeq-based subsetting can be done more simply with:

> getSeq(bs, c("chr1", "chr2"))
  A DNAStringSet instance of length 2
        width seq                                           names               
[1] 195471971 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNN chr1
[2] 182113224 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNN chr2

No need to use a GRanges object. See ?`getSeq,BSgenome-method`

H.

ADD REPLY

Login before adding your answer.

Traffic: 841 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6