1
2
Entering edit mode
@peterkharchenko-8755
Last seen 4.0 years ago

Dear All,

Just wanted to point out that S4Vectors blocks regular Matrix package operations. Matrix is, perhaps, the most popular sparse matrix package for R, and it would be nice if S4Vectors (which is required by many/most of Bioconductor packages) wouldn't break it. Specific example of how masked colSums() call breaks things is shown below. Perhaps there's some way of doing whatever S4Vectors needs to do without masking common functions.

Thank you!

-peter.

>   m <- matrix(sample(c(0,1),1e2,replace=T),nrow=10)
>   library(Matrix)
>   sm <- Matrix(m,sparse=T)
>   colSums(sm)
[1] 6 7 7 7 6 2 4 5 6 7
>  library(S4Vectors)

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from ‘package:stats’:

The following objects are masked from ‘package:base’:

anyDuplicated, append, as.data.frame, cbind, colnames, do.call,
duplicated, eval, evalq, Filter, Find, get, grep, grepl, intersect,
is.unsorted, lapply, lengths, Map, mapply, match, mget, order,
paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind,
Reduce, rownames, sapply, setdiff, sort, table, tapply, union,
unique, unsplit

Attaching package: ‘S4Vectors’

The following objects are masked from ‘package:Matrix’:

colMeans, colSums, expand, rowMeans, rowSums

The following objects are masked from ‘package:base’:

colMeans, colSums, expand.grid, rowMeans, rowSums

> colSums(sm)
Error in colSums(sm) : 'x' must be an array of at least two dimensions
> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8       LC_NAME=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
[1] S4Vectors_0.10.3    BiocGenerics_0.18.0 Matrix_1.2-7.1

loaded via a namespace (and not attached):
[1] grid_3.3.1      lattice_0.20-34
> 
S4Vectors Matrix • 1.5k views
1
Entering edit mode

The work around (in case you're trying to get things done; I'm not suggesting that this won't be addressed) is to fully qualify the function you want to use, Matrix::colSums(sm) or in a package to importFrom(Matrix, colSums).

Both packages (actually, BiocGenerics and Matrix) promote colSums to a generic so that specialized methods can be written for different types of objects. The effect is like both functions defining the same plain-old-function; the most recently attached wins. So in your scenario you get BiocGenerics::colSums(), but want Matrix::colSums(). By reversing the order that packages are loaded, the (psychological) fault shifts to Matrix for masking BiocGenerics -- who do they think they are?!

BiocGenerics is there to minimize this type of conflict within the Bioconductor world, so there are many small wins (no conflicts between unrelated Bioconductor packages defining the same generic) but one large loss (similar small wins / large loss occur with collisions between dplyr and Bioconductor, for instance).

0
Entering edit mode

I came to the same solution (explicitly qualifying the package). I see how promoting new functions to generics can cause problems. I have little experience with declaring generics, but thought there was a way to "append" new classes to an existing generic rather than overwrite it. Thanks for the explanations, Martin!

0
Entering edit mode

The situation is not really symmetric: Matrix is a recommended package and BiocGenerics is not. So one could argue that the former has "more authority" and thus any Bioconductor package that needs to implement methods for some of the generics in Matrix should just depend on it. Then no need to redefine these generics in BiocGenerics. Problem with this is that it introduces a dependency on a big package that takes a while to compile only to import its generics. Not very appealing. If the generics currently defined in Matrix were defined in a light-weight package that Matrix depends on, then the situation would be different. The R-core folks actually did something like this with the stats4 package a long time ago: stats4 is a light-weight package that only defines S4 generics for some of the functions defined in the stats package (or at least it looks like this was the idea even though the package contains other stuff like mle or summary). So if the same was done for Matrix then we would have a Matrix4 package with only generic definitions in it. Then Matrix, matrixStats (CRAN), BufferedMatrix (Bioconductor), DelayedArray (Bioconductor), etc... would import its generics to define methods on them.

This is a recurrent issue and it has been discussed many times over the years with various solutions proposed but not much agreement on what's the best way to deal with this.

H.

0
Entering edit mode

The size of the installed Matrix package on my system (MacOS) is 5.8M, the load time is ~1.5sec, and most users don't compile but install binaries - so, compared to everything else we are lugging around in BioC I wonder how bad just having BiocGenerics depend on Matrix would really be...?

0
Entering edit mode

Except that the generics in question derive from functions in the base package, so it would be something like "base4" but probably just call it "generics". Or, we could just make the functions in question S3 generics.

0
Entering edit mode

A package with domain-specific generics sounds more appealing to me. Less parties involved and more focus. Maybe it would make it more likely to happen. Also the Bioconductor packages that define methods on these generics define S4 methods and therefore need S4 generics.

Anyway I realize now that the Matrix package doesn't define the colSums generics, only a method, thus relying on the implicit generic promotion of base::colSums. So Bioconductor packages that need to define colSums method just need to do the same and don't need to depend on Matrix. Note that this is actually what they were doing before colSums was added to BiocGenerics in November. So everybody was being a good citizen and everything was fine (no conflict).

I'm wondering if having an explicit colSums generic in BiocGenerics is worth the trouble?

H.

0
Entering edit mode

I can't remember why I added those, but there must have been some impetus.

Could fix the definition in Matrix and settle on Matrix as being the S4-ified matrix package. But changing Matrix tends to be costly, and it's pretty heavy. Presumably stats4 would have turned out the same but they just stopped working on it.

Or, the methods package could define implicit generics for them (it already does this for some functions), and then all of those setGeneric("colSums") calls, including the one in Matrix, would just work. But that would be an even more costly change, as I'm sure people have come to specify "ANY" for the other args in the signature.

The methods package could be made smarter so that it could merge generics with the same signature, even when it's not the default signature. While it's not great having generic definitions scattered everywhere, at least things would work.

0
Entering edit mode
@peterkharchenko-8755
Last seen 4.0 years ago

I came to the same solution (explicitly qualifying the package). I see how promoting new functions to generics can cause problems. I have little experience with declaring generics, but thought there was a way to "append" new classes to an existing generic rather than overwrite it. Thanks for the explanations, Martin.