How to properly normalize a single-cell sparse matrix?
Entering edit mode
ssabri ▴ 20
Last seen 2.1 years ago

I am having issues trying to properly normalize single-cell Drop-seq data. My Drop-seq data consists of single-cells (columns) with very sparse transcript gene counts (rows). A summary of the number of non-zero genes (rows) per cell (column) is posted below: 

> summary(colSums(data != 0))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1969    3388    3956    4256    4828   11280 ​

Going off this, there are on average about 75% of zero transcript genes per cell (17k genes total). It has already been noted that methods such as DEseq and limma/voom are not well suited to handle zero-dominated matrices (A: Dependence of rlog transformed value range on number of samples and A: modeling zero-dominated RNA-seq with voom/limma and hurdle models (pscl)). I have tried a very naive method of normalizing but I'm not sure if this is correct in  single-cell practice. My current method of normalizing looks something like this: 

# filter low expressed genes
filter_genes <- function(x, n_cells_express=5, express_threshold=1){
  temp = x[rowSums(x>express_threshold) >= n_cells_express,]
data = filter_genes(data, n_cells_express=2, express_threshold=1)
# normalize by transcript number (i.e. ln(transcripts-per-10,000 + 1))
size_factor = apply(data, 2, function(x) sum(x))
data = log2((sweep(data, 2, size_factor, "/")*10000)+1)

Here I am trying to normalize for cell depth (# of transcripts or column sum), then log2 scale. I should mention that I preform this normalization step after filtering out very rare and lowly expressed genes (approximately 5000/20000 genes).

I am posting to get thoughts on if this is the correct way to normalize sparse, single-cell data? Perhaps there are better or more elaborate ways of normalizing? I would greatly appreciate any help. Thank you.  


Dropseq single-cell normalization sparse • 2.4k views
Entering edit mode
Aaron Lun ★ 27k
Last seen 1 hour ago
The city by the bay

Oddly enough, we have developed a method for normalizing single-cell RNA-seq data that seems applicable to your situation. It is implemented in the scran package, and some use cases are described in a recent workflow.

Entering edit mode

Hmm I'll check out the paper! Looks interesting but we do not work with ERCC spike-in controls. Can your method still be applicable?  

Entering edit mode

Yes, the normalization method can be applied in situations without spike-in controls.

Entering edit mode
Simon Anders ★ 3.7k
Last seen 22 months ago
Zentrum für Molekularbiologie, Universi…

Two suggestions:

1. MA plots are your friend. Pick to cells at random, plot all genes, always  log2 fold change against the log geometric mean and see if you can pick by eye where you would put a horizontal line to make zero log fold change. If so, you know what you try to find an automatic and objective procedure for. If you can't even say by eye, things will be very difficult.

2. The DESeq normalization procedure breaks down because the "virtual reference" we construct is made up of the geometric means of the genes over all samples. This will be zero for any gene that has a zero in at least one cell. This is likely to happen not only because your cells have many zeroes but also because you have very many cells

If you just compare two cells at a time, there might be enough genes with non-zero values in both cells to get a factor to compare these two cells: With k_ij the counts for gene i in cell j, a good factor to normalize between cells j and j' might be:

s_jj' = median_i k_ij / k_ij',

where we let the median only run over those genes i for which k_ij and k_ij' are both not zero.

Problem is that this will give us a matrix of size factors s_jj' only suitable for pairwise comparisons, i.e. comparisons between only two cells at a time. We want a vector with one factor s_j per cell. Maybe one could average:

s_j = mean_j' s_jj'       (j' != j)

Of course we want these averages still to be consistent with the pairwise values, i.e., we want: s_j / s_j' s_jj'. That's easy to check, though.

Not sure whether a simple mean will do the trick but it's worth a try. Otherwise, an iterative procedure might do the trick.

In any case, I think that first making pairs of cells comparable and only then summerising the size factors should be the way to go.

What do the others think?

Entering edit mode

So, I've sampled two cells at random and made a few MA plots as you've suggested and have linked to the results below. 

y = raw_data[, c("RandCellA","RandCellB")]
plot(x=rowMeans(y), y=y[,1]-y[,2]) # (plot A, linked below)
plot(x=rowMeans(log2(y)), y=log2(y[,1])-log2(y[,2])) # (plot B, linked below)

# Now with the normalized data - looks similar to plot B
y = normalized_data[, c("RandCellA","RandCellB")]
x=rowMeans(y), y=y[,1]-y[,2] # (plot C, linked below)

Plots are linked here.

Does this mean that my normalization is okay? 




Entering edit mode

Don't think so. In your last plot, the bulk of genes with medium to high expression is clearly below the zero line.

Maybe try Aaron's method.


Login before adding your answer.

Traffic: 145 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6