Question: computeSumFactors: warning about cluster size
gravatar for sfpacman
9 weeks ago by
sfpacman0 wrote:

I am running computeSumFactors on a rather large expression matrix ( 68K ~ 20K).  Instead of running quickCluster on such a large matrix, I use k-mean clustering result from a previous analysis ( PCA of top 1000 variable genes). However, it returns the following warning upon finished : number of cells in each cluster should be at least twice that of the largest 'sizes'. I was wondering if it will have any adverse effects on the size factor when the number of cells are not evenly distributed in each clusters.   


ADD COMMENTlink modified 9 weeks ago • written 9 weeks ago by sfpacman0
gravatar for Aaron Lun
9 weeks ago by
Aaron Lun16k
Cambridge, United Kingdom
Aaron Lun16k wrote:

No, it's fine. We threw in warnings when we were developing the method, but later on, we found that it didn't really matter (as long as you have enough cells - usually at least 100 - in each cluster, which is ensures you get precise estimates). As of the next release, this particular warning will be removed; the function will also be more tolerant of number of cells in each cluster below sizes, producing warnings rather than errors.

Note that clustering prior to computeSumFactors should be done in a way that is insensitive to the size factors. Otherwise, in extreme cases, you cluster cells that have similar library sizes, rather than those with similar expression profiles. We suggest doing something like computing ranks (e.g., with quickCluster and get.ranks=TRUE) and running a clustering algorithm on that instead. You can use k-means, or you can try quickCluster with method="igraph" (parallelizable via BiocParallel). This uses a community-based detection algorithm for clustering, which avoids constructing the distance matrix for large numbers of cells.

Also, you misspelt the tag, which is why I didn't see this post until now.

ADD COMMENTlink modified 9 weeks ago • written 9 weeks ago by Aaron Lun16k

Thank you very much. I think I have the older version of scran (scran_1.2.2) , because it does not have the options for method and  get.ranks when I try to use quickCluster. It may sound silly , but I wonder if there is any way to work around it. If not, I probably have to upgrade both R  and bioconductor.  

ADD REPLYlink modified 9 weeks ago • written 9 weeks ago by sfpacman0

Yes, upgrading R and Bioconductor would be wise. The single-cell field moves quickly so you really want to get the latest versions of all packages. I personally switch to new versions of R as soon as they are available.

ADD REPLYlink written 9 weeks ago by Aaron Lun16k

Thanks, I have installed the latest version. I am wondering how to use BiocParallel  for quickCluster with igrpah option. Also, is it normal to consume 60 to 80 gb of memory for the matrix of this kind of size  ?  

ADD REPLYlink modified 9 weeks ago • written 9 weeks ago by sfpacman0

Ah, the parallelization is only supported in the devel version at the moment; got my wires crossed. Currently we're transitioning to the SingleCellExperiment class, which is pretty hectic; so until the next release (next month, I think?) the current version of scran will not receive any new features.

As for the memory consumption; that is somewhat unusual, though not impossible, as the current version of scran (due to the limitation of ExpressionSet objects) represents all data inefficiently as dense matrices. The next version will provide proper support for sparse and file-backed matrices, so this should cease to be a major issue.

ADD REPLYlink modified 9 weeks ago • written 9 weeks ago by Aaron Lun16k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 168 users visited in the last hour