Question: computeSumFactors: warning about cluster size
gravatar for sfpacman
7 days ago by
sfpacman0 wrote:

I am running computeSumFactors on a rather large expression matrix ( 68K ~ 20K).  Instead of running quickCluster on such a large matrix, I use k-mean clustering result from a previous analysis ( PCA of top 1000 variable genes). However, it returns the following warning upon finished : number of cells in each cluster should be at least twice that of the largest 'sizes'. I was wondering if it will have any adverse effects on the size factor when the number of cells are not evenly distributed in each clusters.   


ADD COMMENTlink modified 7 days ago • written 7 days ago by sfpacman0
gravatar for Aaron Lun
7 days ago by
Aaron Lun15k
Cambridge, United Kingdom
Aaron Lun15k wrote:

No, it's fine. We threw in warnings when we were developing the method, but later on, we found that it didn't really matter (as long as you have enough cells - usually at least 100 - in each cluster, which is ensures you get precise estimates). As of the next release, this particular warning will be removed; the function will also be more tolerant of number of cells in each cluster below sizes, producing warnings rather than errors.

Note that clustering prior to computeSumFactors should be done in a way that is insensitive to the size factors. Otherwise, in extreme cases, you cluster cells that have similar library sizes, rather than those with similar expression profiles. We suggest doing something like computing ranks (e.g., with quickCluster and get.ranks=TRUE) and running a clustering algorithm on that instead. You can use k-means, or you can try quickCluster with method="igraph" (parallelizable via BiocParallel). This uses a community-based detection algorithm for clustering, which avoids constructing the distance matrix for large numbers of cells.

Also, you misspelt the tag, which is why I didn't see this post until now.

ADD COMMENTlink modified 7 days ago • written 7 days ago by Aaron Lun15k

Thank you very much. I think I have the older version of scran (scran_1.2.2) , because it does not have the options for method and  get.ranks when I try to use quickCluster. It may sound silly , but I wonder if there is any way to work around it. If not, I probably have to upgrade both R  and bioconductor.  

ADD REPLYlink modified 7 days ago • written 7 days ago by sfpacman0

Yes, upgrading R and Bioconductor would be wise. The single-cell field moves quickly so you really want to get the latest versions of all packages. I personally switch to new versions of R as soon as they are available.

ADD REPLYlink written 7 days ago by Aaron Lun15k

Thanks, I have installed the latest version. I am wondering how to use BiocParallel  for quickCluster with igrpah option. Also, is it normal to consume 60 to 80 gb of memory for the matrix of this kind of size  ?  

ADD REPLYlink modified 2 days ago • written 2 days ago by sfpacman0

Ah, the parallelization is only supported in the devel version at the moment; got my wires crossed. Currently we're transitioning to the SingleCellExperiment class, which is pretty hectic; so until the next release (next month, I think?) the current version of scran will not receive any new features.

As for the memory consumption; that is somewhat unusual, though not impossible, as the current version of scran (due to the limitation of ExpressionSet objects) represents all data inefficiently as dense matrices. The next version will provide proper support for sparse and file-backed matrices, so this should cease to be a major issue.

ADD REPLYlink modified 2 days ago • written 2 days ago by Aaron Lun15k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 230 users visited in the last hour