2.1 years ago by
Cambridge, United Kingdom
You should be removing lowly or non-expressed genes prior to normalization (or any analysis, really). This is because the discreteness of the counts interferes with most downstream analyses. In the case of normalization, having lots of genes with low counts will break the median-based estimate of the cell-specific bias. In the most extreme case, if I have more than 50% of genes with a count of zero in a particular cell, then the median will be zero. This would be nonsensical to use as a size factor for that cell.
More generally, it's difficult to get a good measure of the average bias when you have lots of low counts. This makes the distribution look more exponential-like rather than unimodal. For such distributions, the median won't perform well for normalization as it doesn't match up with the expected value of the cell-specific bias. A related issue is that the ratio of counts to the mean is more variable at low means than at high means; so if you flood the algorithm with lots of low-abundance genes, it may reduce the precision of the size factor estimates.
In short, you should filter before you normalize - or really, before you do anything else. However, there's nothing stopping you from filtering first, normalizing, and then applying the size factors back to the unfiltered data with
sizeFactors<-. In fact, in the devel version of scran, there is a
subset.row argument in
computeSumFactors with which you can select the genes to use for normalization. This should facilitate the procedure that I described previously, as it will return all genes in the output
modified 2.1 years ago
2.1 years ago by
Aaron Lun • 21k