Question: Does the normalization pipeline for scran include filtering out genes with low counts?
gravatar for amckenz
18 months ago by
amckenz0 wrote:

I am asking a question about scran, described in this paper:

In the github code, low expression genes were removed prior to running the normalization pipeline in the Zeisel et al. analysis. However, in my reading this did not seem to be emphasized in the manuscript, so it may have been just to compare to the other normalization methods compared in that data set.

So my question: Is it necessary to remove genes with zero counts prior to using the computeSumFactors function in scran to normalize data? I would prefer not to do this if possible, to compare genes across data sets. 

ADD COMMENTlink modified 18 months ago by Aaron Lun18k • written 18 months ago by amckenz0
gravatar for Aaron Lun
18 months ago by
Aaron Lun18k
Cambridge, United Kingdom
Aaron Lun18k wrote:

You should be removing lowly or non-expressed genes prior to normalization (or any analysis, really). This is because the discreteness of the counts interferes with most downstream analyses. In the case of normalization, having lots of genes with low counts will break the median-based estimate of the cell-specific bias. In the most extreme case, if I have more than 50% of genes with a count of zero in a particular cell, then the median will be zero. This would be nonsensical to use as a size factor for that cell.

More generally, it's difficult to get a good measure of the average bias when you have lots of low counts. This makes the distribution look more exponential-like rather than unimodal. For such distributions, the median won't perform well for normalization as it doesn't match up with the expected value of the cell-specific bias. A related issue is that the ratio of counts to the mean is more variable at low means than at high means; so if you flood the algorithm with lots of low-abundance genes, it may reduce the precision of the size factor estimates.

In short, you should filter before you normalize - or really, before you do anything else. However, there's nothing stopping you from filtering first, normalizing, and then applying the size factors back to the unfiltered data with sizeFactors<-. In fact, in the devel version of scran, there is a subset.row argument in computeSumFactors with which you can select the genes to use for normalization. This should facilitate the procedure that I described previously, as it will return all genes in the output SCESet object.

ADD COMMENTlink modified 18 months ago • written 18 months ago by Aaron Lun18k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 163 users visited in the last hour