Question

Does the normalization pipeline for scran include filtering out genes with low counts?

0

Entering edit mode

amckenz • 0

@amckenz-11264

Last seen 3.3 years ago

I am asking a question about scran, described in this paper: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0947-7

In the github code, low expression genes were removed prior to running the normalization pipeline in the Zeisel et al. analysis. However, in my reading this did not seem to be emphasized in the manuscript, so it may have been just to compare to the other normalization methods compared in that data set.

So my question: Is it necessary to remove genes with zero counts prior to using the computeSumFactors function in scran to normalize data? I would prefer not to do this if possible, to compare genes across data sets.

filtering scran • 1.0k views

ADD COMMENT • link updated 7.7 years ago by Aaron Lun ★ 28k • written 7.7 years ago by amckenz • 0

score 0 · Answer 1 · 2016-08-09

You should be removing lowly or non-expressed genes prior to normalization (or any analysis, really). This is because the discreteness of the counts interferes with most downstream analyses. In the case of normalization, having lots of genes with low counts will break the median-based estimate of the cell-specific bias. In the most extreme case, if I have more than 50% of genes with a count of zero in a particular cell, then the median will be zero. This would be nonsensical to use as a size factor for that cell.

More generally, it's difficult to get a good measure of the average bias when you have lots of low counts. This makes the distribution look more exponential-like rather than unimodal. For such distributions, the median won't perform well for normalization as it doesn't match up with the expected value of the cell-specific bias. A related issue is that the ratio of counts to the mean is more variable at low means than at high means; so if you flood the algorithm with lots of low-abundance genes, it may reduce the precision of the size factor estimates.

In short, you should filter before you normalize - or really, before you do anything else. However, there's nothing stopping you from filtering first, normalizing, and then applying the size factors back to the unfiltered data with sizeFactors<-. In fact, in the devel version of scran, there is a subset.row argument in computeSumFactors with which you can select the genes to use for normalization. This should facilitate the procedure that I described previously, as it will return all genes in the output SCESet object.