Question: How to properly normalize a single-cell sparse matrix?
gravatar for ssabri
16 months ago by
ssabri10 wrote:

I am having issues trying to properly normalize single-cell Drop-seq data. My Drop-seq data consists of single-cells (columns) with very sparse transcript gene counts (rows). A summary of the number of non-zero genes (rows) per cell (column) is posted below: 

> summary(colSums(data != 0))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1969    3388    3956    4256    4828   11280 ​

Going off this, there are on average about 75% of zero transcript genes per cell (17k genes total). It has already been noted that methods such as DEseq and limma/voom are not well suited to handle zero-dominated matrices (A: Dependence of rlog transformed value range on number of samples and A: modeling zero-dominated RNA-seq with voom/limma and hurdle models (pscl)). I have tried a very naive method of normalizing but I'm not sure if this is correct in  single-cell practice. My current method of normalizing looks something like this: 

# filter low expressed genes
filter_genes <- function(x, n_cells_express=5, express_threshold=1){
  temp = x[rowSums(x>express_threshold) >= n_cells_express,]
data = filter_genes(data, n_cells_express=2, express_threshold=1)
# normalize by transcript number (i.e. ln(transcripts-per-10,000 + 1))
size_factor = apply(data, 2, function(x) sum(x))
data = log2((sweep(data, 2, size_factor, "/")*10000)+1)

Here I am trying to normalize for cell depth (# of transcripts or column sum), then log2 scale. I should mention that I preform this normalization step after filtering out very rare and lowly expressed genes (approximately 5000/20000 genes).

I am posting to get thoughts on if this is the correct way to normalize sparse, single-cell data? Perhaps there are better or more elaborate ways of normalizing? I would greatly appreciate any help. Thank you.  


ADD COMMENTlink modified 16 months ago by Aaron Lun18k • written 16 months ago by ssabri10
gravatar for Aaron Lun
16 months ago by
Aaron Lun18k
Cambridge, United Kingdom
Aaron Lun18k wrote:

Oddly enough, we have developed a method for normalizing single-cell RNA-seq data that seems applicable to your situation. It is implemented in the scran package, and some use cases are described in a recent workflow.

ADD COMMENTlink written 16 months ago by Aaron Lun18k

Hmm I'll check out the paper! Looks interesting but we do not work with ERCC spike-in controls. Can your method still be applicable?  

ADD REPLYlink written 16 months ago by ssabri10

Yes, the normalization method can be applied in situations without spike-in controls.

ADD REPLYlink written 16 months ago by Aaron Lun18k
gravatar for Simon Anders
16 months ago by
Simon Anders3.4k
Zentrum für Molekularbiologie, Universität Heidelberg
Simon Anders3.4k wrote:

Two suggestions:

1. MA plots are your friend. Pick to cells at random, plot all genes, always  log2 fold change against the log geometric mean and see if you can pick by eye where you would put a horizontal line to make zero log fold change. If so, you know what you try to find an automatic and objective procedure for. If you can't even say by eye, things will be very difficult.

2. The DESeq normalization procedure breaks down because the "virtual reference" we construct is made up of the geometric means of the genes over all samples. This will be zero for any gene that has a zero in at least one cell. This is likely to happen not only because your cells have many zeroes but also because you have very many cells

If you just compare two cells at a time, there might be enough genes with non-zero values in both cells to get a factor to compare these two cells: With k_ij the counts for gene i in cell j, a good factor to normalize between cells j and j' might be:

s_jj' = median_i k_ij / k_ij',

where we let the median only run over those genes i for which k_ij and k_ij' are both not zero.

Problem is that this will give us a matrix of size factors s_jj' only suitable for pairwise comparisons, i.e. comparisons between only two cells at a time. We want a vector with one factor s_j per cell. Maybe one could average:

s_j = mean_j' s_jj'       (j' != j)

Of course we want these averages still to be consistent with the pairwise values, i.e., we want: s_j / s_j' s_jj'. That's easy to check, though.

Not sure whether a simple mean will do the trick but it's worth a try. Otherwise, an iterative procedure might do the trick.

In any case, I think that first making pairs of cells comparable and only then summerising the size factors should be the way to go.

What do the others think?

ADD COMMENTlink modified 16 months ago • written 16 months ago by Simon Anders3.4k

So, I've sampled two cells at random and made a few MA plots as you've suggested and have linked to the results below. 

y = raw_data[, c("RandCellA","RandCellB")]
plot(x=rowMeans(y), y=y[,1]-y[,2]) # (plot A, linked below)
plot(x=rowMeans(log2(y)), y=log2(y[,1])-log2(y[,2])) # (plot B, linked below)

# Now with the normalized data - looks similar to plot B
y = normalized_data[, c("RandCellA","RandCellB")]
x=rowMeans(y), y=y[,1]-y[,2] # (plot C, linked below)

Plots are linked here.

Does this mean that my normalization is okay? 




ADD REPLYlink written 16 months ago by ssabri10

Don't think so. In your last plot, the bulk of genes with medium to high expression is clearly below the zero line.

Maybe try Aaron's method.

ADD REPLYlink written 16 months ago by Simon Anders3.4k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 157 users visited in the last hour