Search
Question: computeSumFactors: warning about cluster size
0
11 months ago by
sfpacman0
sfpacman0 wrote:

I am running computeSumFactors on a rather large expression matrix ( 68K ~ 20K).  Instead of running quickCluster on such a large matrix, I use k-mean clustering result from a previous analysis ( PCA of top 1000 variable genes). However, it returns the following warning upon finished : number of cells in each cluster should be at least twice that of the largest 'sizes'. I was wondering if it will have any adverse effects on the size factor when the number of cells are not evenly distributed in each clusters.

modified 11 months ago • written 11 months ago by sfpacman0
1
11 months ago by
Aaron Lun20k
Cambridge, United Kingdom
Aaron Lun20k wrote:

No, it's fine. We threw in warnings when we were developing the method, but later on, we found that it didn't really matter (as long as you have enough cells - usually at least 100 - in each cluster, which is ensures you get precise estimates). As of the next release, this particular warning will be removed; the function will also be more tolerant of number of cells in each cluster below sizes, producing warnings rather than errors.

Note that clustering prior to computeSumFactors should be done in a way that is insensitive to the size factors. Otherwise, in extreme cases, you cluster cells that have similar library sizes, rather than those with similar expression profiles. We suggest doing something like computing ranks (e.g., with quickCluster and get.ranks=TRUE) and running a clustering algorithm on that instead. You can use k-means, or you can try quickCluster with method="igraph" (parallelizable via BiocParallel). This uses a community-based detection algorithm for clustering, which avoids constructing the distance matrix for large numbers of cells.

Also, you misspelt the tag, which is why I didn't see this post until now.

Thank you very much. I think I have the older version of scran (scran_1.2.2) , because it does not have the options for method and  get.ranks when I try to use quickCluster. It may sound silly , but I wonder if there is any way to work around it. If not, I probably have to upgrade both R  and bioconductor.

Yes, upgrading R and Bioconductor would be wise. The single-cell field moves quickly so you really want to get the latest versions of all packages. I personally switch to new versions of R as soon as they are available.

Thanks, I have installed the latest version. I am wondering how to use BiocParallel  for quickCluster with igrpah option. Also, is it normal to consume 60 to 80 gb of memory for the matrix of this kind of size  ?

Ah, the parallelization is only supported in the devel version at the moment; got my wires crossed. Currently we're transitioning to the SingleCellExperiment class, which is pretty hectic; so until the next release (next month, I think?) the current version of scran will not receive any new features.

As for the memory consumption; that is somewhat unusual, though not impossible, as the current version of scran (due to the limitation of ExpressionSet objects) represents all data inefficiently as dense matrices. The next version will provide proper support for sparse and file-backed matrices, so this should cease to be a major issue.