scran: excluding genes with extremely high expression during normalization?
1
0
Entering edit mode
@danielborshagovski-17879
Last seen 12 months ago
Sweden

Hi everyone,

I would like to know if excluding genes with extremely high expression is suitable for scran normalization in single-cell RNA-seq analysis. This type of exclusion is used in the library-size normalization of the SPRING tool (only genes making up <5% of total counts in every cell are used for normalization):

This type of exclusion might help with datasets where some cell types have a few dominant genes making up most of the counts, like secretory cells such as pancreatic beta cells or Paneth cells of the small intestinal epithelium.

An example is the Haber et al. 2017 study of the small intestinal epithelium (doi:10.1038/nature24489). In the full-length atlas dataset (count matrix downloaded from ebi.ac.uk/gxa/sc) over 50% of counts in Paneth cells come from just 10 genes (judging by the scater QC function "plotHighestExprs"). In this dataset, the CPM values for most genes encoding early secretory pathway components (e.g. Golgi and secretory vesicle proteins) are lower in Paneth cells than intestinal stem cells. This seems unrealistic, as the Paneth cell is a large, secretory cell containing bigger Golgis and many more secretory vesicles than the stem cells. If the top 10 most highly expressed genes in Paneth cells are simply removed from the count matrix, most genes encoding early secretory pathway components are higher in the Paneth cells.

Could the exclusion of genes during normalization be performed by providing a whitelist of genes with the "subset.row" argument for the function "computeSumFactors"?

Best regards, Daniel

single-cell RNA-seq scran normalization • 611 views
2
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 14 hours ago
The city by the bay

I'll address the broad premise of your question first. A small number of genes with large counts are not a problem for normalization. In every bulk or single-cell data set, you will see the "usual suspects" show up in the most highly expressed genes, e.g., ribosomal protein genes, MALAT1, actin B, histone components. It's not a big deal, because each of them just count as one gene in median-based normalization strategies, regardless of how much they are expressed; they won't overly influence the results.

The real problem is whether the highly expressed genes are strongly DE between cell types, states, or whatever. The situation that you've described here is exactly the intended use case of computeSumFactors(), i.e., a small number of highly DE genes that introduce composition biases in all other (non-DE) genes. This is the raison d'etre of robust normalization strategies compared to simpler approaches based on library size. With computeSumFactors(), there is no need to explicitly remove the offending genes, because they are considered to be outliers and are ignored when computing the relative biases between cells. This is especially important when you don't know which genes are the problematic ones ahead of time.

Of course, if you do know which genes are the problematic ones, then you can just get rid of them, and make life easier for everyone. Even computeSumFactors() will be slightly more accurate if you were to do so; compared to robustifying a procedure against a problem, it's always better to just not have the problem in the first place. However, the identity of the problematic genes is not usually known in advance, e.g., if you know that much about the DE in the system, why do you need to do another experiment? And it's also a pain to have to refilter out those genes every time you apply your analysis pipeline to a new system; it's just logistically easier to have a single robust normalization step that you can throw at anything.

tl;dr You can try it with and without those genes, but I don't think it'll make a major difference.