Subsetting a GenomicRanges::GRanges with "%in%" fails
1
1
Entering edit mode
Aditya ▴ 160
@aditya-7667
Last seen 2.4 years ago
Germany

Subsetting a GRanges with == works:

granges <- GenomicRanges::GRanges(
               c('chr1', 'chr2'), 
               '100-200', 
               seqinfo =  GenomeInfoDb::seqinfo(BSgenome.Mmusculus.UCSC.mm10::Mmusculus))
subset(granges, seqnames == 'chr1')

But subsetting with %in% fails:

subset(granges, seqnames %in% c('chr1', 'chr3'))
Error in match(x, table, nomatch = 0L) : 'match' requires vector arguments
genomicranges • 2.8k views
ADD COMMENT
0
Entering edit mode
@michael-lawrence-3846
Last seen 2.9 years ago
United States

Running that code fails for me:

> granges <- GenomicRanges::GRanges(
+                               c('chr1', 'chr2'), 
+                               '100-200')
> granges %>% subset(seqnames == 'chr1')
Error in granges %>% subset(seqnames == "chr1") : 
  could not find function "%>%"

But the problem is likely that the %in%() generic is not visible, so the version in base is being called, which does not support the run-length encoding of the seqnames (an Rle object). The easiest solution is to just call library(GenomicRanges).

ADD COMMENT
0
Entering edit mode

Thank you Michael :-). Sorry for forgetting require(magrittr). Question now updated without magrittr to keep it simple. What would be a clean-namespace alternative to require(GenomicRanges) for GRanges subsetting?

ADD REPLY
1
Entering edit mode

You could always do something like IRanges::`%in%`(x, y). Keeping the workspace clean can be a pain though, because you'll constantly have to remember the origin of each function.

ADD REPLY
0
Entering edit mode

I see, IRanges is the place where it lives - thanks!

(Michael, merging my answer - based on your comments - into your's and then deleting mine could keep things clean for future reference)

ADD REPLY
1
Entering edit mode

Nope, %in% doesn't live in IRanges. IRanges used to define S4 methods for %in% but not anymore. The fact that IRanges still exports %in% is a leftover from a long time ago. That needs to be removed. The S4Vectors package defines the %in% method for Vector derivatives. Note that %in% is an implicit generic. This means that we don't have a setGeneric() statement somewhere that promotes it to an S4 generic function. It just gets automatically promoted on the first setMethod statement.

A cleaner situation would be to define the %in% generic with an explicit setGeneric statement in BiocGenerics. We will do this soon. Then all you need to do is import the generic from BiocGenerics in your NAMESPACE. Then just use %in% normally. This will call the generic, which takes care of dispatching to the appropriate method. Always call the generic. Trying to call a particular method is not robust.

So no need to do things like S4Vectors::`%in%`(x, y) or BiocGenerics::`%in%`(x, y). If you fully import BiocGenerics and S4Vectors (with import(BiocGenerics) and import(S4Vectors)), which I strongly recommend, then you can start using %in% now without having to worry where the generic is defined.

H.

ADD REPLY
0
Entering edit mode

Oh, that's illuminating - thx!

Is the following understanding correct?

  • BioCgenerics contains all generics required by BioC classes. When a need for a new generic arises, it is explicitated in this package and exported. The idiom is the @import BiocGenerics, and then one is free to play.

  • S4Vectors::Vector is the fundamental BioC datatype. Other S4 classes, e.g. GenomicRanges, SummarizedExperiment, etc. inherit from Vector and build further on that. Therefore, the idiom is to also @import S4Vectors and then the BioC playfield lays wide open.

ADD REPLY
1
Entering edit mode

Not exactly as there are generics specific to certain data types defined by the packages you mention and many others. Unless you are using a package for a very specific reason (like a single utility function), it's simplest to just import the package in bulk.

ADD REPLY
0
Entering edit mode

I see - it looks like I have to switch idioms then.

My default idiom has been to maintain a fully clean namespace, knowing exactly which function from which package is being used, being 100 % sure that no namespace clashes can occur.

But it looks like I need to drop that paradigm in the S4 world...

ADD REPLY
0
Entering edit mode

Would the following statement be right then?

On the one hand BiocGenerics exports generics common to multiple BioC packages.

On the other hand, GenomicRanges exports GRanges-specific generics, SummarizedExperiment exports SummarizedExperiment-specific generics, etc.

ADD REPLY
1
Entering edit mode

Yes, that's the idea. See ?BiocGenerics for the 2 kinds of generics we define in the BiocGenerics package. Also note that the proteomics folks define their own set of S4 generics in the ProtGenerics package.

ADD REPLY
0
Entering edit mode

Thanks Herve! So the apparant namespace clashes upon require(BiocGenerics) are actually intentional promotions of base primitives and S3 generics to S4 generics?

ADD REPLY
0
Entering edit mode

Exactly.

ADD REPLY
0
Entering edit mode

FYI %in% is now an explicit S4 generic defined in the BiocGenerics package: https://github.com/Bioconductor/BiocGenerics/commit/30d0813751536f0dd03b36d1f90c116e889d1954

H.

ADD REPLY
0
Entering edit mode

Thanks Herve, also for providing the update :-)

ADD REPLY

Login before adding your answer.

Traffic: 532 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6