Issue with GSVA using dgCMatrix
1
0
Entering edit mode
resa • 0
@6753a939
Last seen 4.3 years ago
Finland

Hi Robert Castelo ,

I face an error when trying to run gsva function from the GSVA package using a big sparse matrix (class dgCMatrix). The matrix has dimensions of around 30k x 120k. I've noticed that using dgCMatrix objects as input is still in an experimental stage but would like to know if there's some solution under way to this.

The code starts running normally but after a while stops:

Estimating GSVA scores for 50 gene sets.
Estimating ECDFs with Gaussian kernels

Error in as.vector(.Call(Csparse_to_vector, x), mode) : Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 102

Here is some of the traceback:

17: as.vector(.Call(Csparse_to_vector, x), mode)
16: as.vector(x, mode)
15: as.vector(x, mode)
14: as.vector(x)
13: as.double(t(expr[, sample.idxs, drop = FALSE]))
12: as.double(t(expr[, sample.idxs, drop = FALSE]))
11: compute.gene.density(expr, sample.idxs, rnaseq, kernel)
10: compute.geneset.es(expr, gset.idx.list, 1:n.samples, rnaseq = rnaseq,
        abs.ranking = abs.ranking, parallel.sz = parallel.sz, mx.diff = mx.diff,
        tau = tau, kernel = kernel, verbose = verbose, BPPARAM = BPPARAM)
9: .gsva(expr, mapped.gset.idx.list, method, kcdf, rnaseq, abs.ranking,
       parallel.sz, mx.diff, tau, kernel, ssgsea.norm, verbose,
       BPPARAM)
8: .local(expr, gset.idx.list, ...)
7: gsva(logtpm_matrix, gene_sets, kcdf = "Gaussian", min.sz = 5,
       max.sz = 500, parallel.sz = 1, verbose = TRUE)
sessionInfo( )

R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.2 LTS

Matrix products: default
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] RColorBrewer_1.1-2   GSEABase_1.54.0      graph_1.70.0
 [4] annotate_1.70.0      XML_3.99-0.6         AnnotationDbi_1.54.1
 [7] IRanges_2.26.0       S4Vectors_0.30.0     Biobase_2.52.0
[10] BiocGenerics_0.38.0  GSVA_1.40.1          patchwork_1.1.1
[13] SeuratObject_4.0.2   Seurat_4.0.3         dplyr_1.0.6
GSVA • 4.6k views
ADD COMMENT
1
Entering edit mode
Robert Castelo ★ 3.4k
@rcastelo
Last seen 8 weeks ago
Barcelona/Universitat Pompeu Fabra

Hi,

Indeed we're still working on this for the default GSVA method, but using the ssGSEA method, i.e., adding the argument method="ssgsea" to the call to the gsva() function, should work. For a sparse matrix of these dimensions the ssGSEA method should run in about 1 hour with 15 cores in a modern workstation, i.e., using the argument BPPARAM = MulticoreParam(workers = 15L, progressbar = TRUE).

ADD COMMENT
0
Entering edit mode

Hi, Thank you for the information! I'll try ssGSEA in the meantime.

ADD REPLY
0
Entering edit mode

If I may quickly inquire on this. I tried this:

bpparam <- MulticoreParam(workers = 15L, progressbar = TRUE)

# Run GSVA
gsva_results <- gsva(ssgseaParam(data, gene_sets), BPPARAM=bpparam)

With data being a SingleCellExperiment with a sparse assay matrix X and gene_sets being a GeneSetCollection. Unfortunately however, it converts the sparse matrix to dense:

Warning message in asMethod(object):
"sparse->dense coercion: allocating vector of size 6.3 GiB" <- NOTE THIS
Warning message in .filterFeatures_newAPI(dataMatrix, dropConstantRows = FALSE):
"59552 rows with constant values throughout the samples."

EDIT: with gsva(ssgseaParam(assay(data, "X"), gene_sets), ...) it uses the sparse matrix, however fails at the Normalization step:

Warning message in .filterFeatures_newAPI(dataMatrix, dropConstantRows = FALSE):
"59552 rows with constant values throughout the samples."
No annotation package name available in the input data object.
Attempting to directly match identifiers in data to gene sets.
Setting parallel calculations through a MulticoreParam back-end
with workers=15 and tasks=100.
Estimating ssGSEA scores for 216 gene sets.
[1] "Calculating ranks..."
[1] "Calculating absolute values from ranks..."
  |======================================================================| 100%

[1] "Normalizing..."
Error in (function (classes, fdef, mtable) : unable to find an inherited method for function 'wrapData' for signature '"dgCMatrix", "dgCMatrix"'
Traceback:

1. gsva(ssgseaParam(assay(data, "X"), gene_sets), BPPARAM = bpparam)
2. gsva(ssgseaParam(assay(data, "X"), gene_sets), BPPARAM = bpparam)
3. .local(expr, gset.idx.list = gset.idx.list, ...)
4. wrapData(rval, exprData)
5. (function (classes, fdef, mtable) 
 . {
 .     methods <- .findInheritedMethods(classes, fdef, mtable)
 .     if (length(methods) == 1L) 
 .         return(methods[[1L]])
 .     else if (length(methods) == 0L) {
 .         cnames <- paste0("\"", vapply(classes, as.character, 
 .             ""), "\"", collapse = ", ")
 .         stop(gettextf("unable to find an inherited method for function %s for signature %s", 
 .             sQuote(fdef@generic), sQuote(cnames)), domain = NA)
 .     }
 .     else stop("Internal error in finding inherited methods; didn't return a unique method", 
 .         domain = NA)
 . })(list(structure("dgCMatrix", package = "Matrix"), structure("dgCMatrix", package = "Matrix")), 
 .     new("standardGeneric", .Data = function (dataMatrix, container) 
 .     standardGeneric("wrapData"), generic = structure("wrapData", package = "GSVA"), 
 .         package = "GSVA", group = list(), valueClass = character(0), 
 .         signature = c("dataMatrix", "container"), default = NULL, 
 .         skeleton = (function (dataMatrix, container) 
 .         stop(gettextf("invalid call in method dispatch to '%s' (no default method)", 
 .             "wrapData"), domain = NA))(dataMatrix, container)), 
 .     <environment>)
6. stop(gettextf("unable to find an inherited method for function %s for signature %s", 
 .     sQuote(fdef@generic), sQuote(cnames)), domain = NA)

(Also, do you know why 99% (59552) of my genes are considered "constant"? There is quite some variability in the data (it contains a lot of diverse samples though)).

ADD REPLY
0
Entering edit mode

Hi Moritz,

could you please let me know which version of GSVA this is, ideally the output of sessionInfo()? Did you install it using BiocManager::install("GSVA") or any other source or method?

Did your first attempt produce useful results despite the two warnings? Depending on your data. the sparse->dense coercion will use more memory but should otherwise give correct results and while ssGSEA checks for constant genes it does not drop them.

ADD REPLY
0
Entering edit mode

Conversation moved to https://github.com/rcastelo/GSVA/issues/113

The dataset I am using at the moment is just a test dataset. The main dataset is much bigger and it's impossible to work with it in the dense fashion.

May I kindly ask for more specifics on how the constant genes are determined (as they are not 100% constant)? I had a quick search but could not find information on it in the ssGSEA publication.

If they are determinted via a heuristic (e.g. little variance across samples), but not deleted, I wouldn't mind actually.

ADD REPLY
0
Entering edit mode

Briefly, GSVA always tries to identify constant genes and warn about them.

Genes found to be constant will be removed

  • if this is part of the published method, or
  • if they would cause errors in subsequent computation.

In particular, they are not removed when the method is ssGSEA.

Earlier versions of GSVA tried to identify constant genes by checking if their standard deviation == 0 but because of the vagaries of floating point arithmetics this caused issues that are probably best summarized in #87

Hence, in the current release version 1.50.0, genes are considered constant if their SD is less than 1e-10. As you have seen, this arbitrary threshold may result in false positives and worse, it will fail to identify genes with a single non-zero value in a sparse matrix.

Therefore, in the current devel version (that will become 1.52.0 with the release of BioC 3.19 in April), genes will be considered constant, if their max(.) == min(.), over either all values or, in the case of sparse matrices, their non-zero values, as implemented in GSVA:::.filterGenes().

This approach has been chosen because it (hopefully)

  • avoids floating point arithmetics altogether,
  • scales reasonably with number of rows and columns,
  • can be implemented efficiently for dense and sparse matrices via matrixStats and sparseMatrixStats.

I'd of course be grateful for better ideas that meet these criteria. ;-)

ADD REPLY
1
Entering edit mode

This sounds great and helped a lot to pinpoint my issue: I was loading a sparse matrix from h5ad and it failed loading the X matrix (so all values were 0).

ADD REPLY

Login before adding your answer.

Traffic: 381 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6