**10**wrote:

I am having issues trying to properly normalize single-cell Drop-seq data. My Drop-seq data consists of single-cells (columns) with very sparse transcript gene counts (rows). A summary of the number of non-zero genes (rows) per cell (column) is posted below:

> summary(colSums(data != 0)) Min. 1st Qu. Median Mean 3rd Qu. Max. 1969 3388 3956 4256 4828 11280

Going off this, there are on average about 75% of zero transcript genes per cell (17k genes total). It has already been noted that methods such as DEseq and limma/voom are not well suited to handle zero-dominated matrices (A: Dependence of rlog transformed value range on number of samples and A: modeling zero-dominated RNA-seq with voom/limma and hurdle models (pscl)). I have tried a very naive method of normalizing but I'm not sure if this is correct in single-cell practice. My current method of normalizing looks something like this:

############--------------------------------------------############ # filter low expressed genes filter_genes <- function(x, n_cells_express=5, express_threshold=1){ temp = x[rowSums(x>express_threshold) >= n_cells_express,] return(temp) } data = filter_genes(data, n_cells_express=2, express_threshold=1)

############--------------------------------------------############ # normalize by transcript number (i.e. ln(transcripts-per-10,000 + 1)) size_factor = apply(data, 2, function(x) sum(x)) data = log2((sweep(data, 2, size_factor, "/")*10000)+1)

Here I am trying to normalize for cell depth (# of transcripts or column sum), then log2 scale. I should mention that I preform this normalization step after filtering out very rare and lowly expressed genes (approximately 5000/20000 genes).

I am posting to get thoughts on if this is the correct way to normalize sparse, single-cell data? Perhaps there are better or more elaborate ways of normalizing? I would greatly appreciate any help. Thank you.