Question

Feature selection across batches in simpleSingleCell "Correcting batch effects" vignette

0

Entering edit mode

Angelos Armen • 0

@angelos-armen-21507

Last seen 17 months ago

United Kingdom

In the Feature selection across batches section of the simpleSingleCell Correcting batch effects vignette, genes with positive average (across batches) biological variance are selected. What is the reasoning behind that? Why aren't genes with positive biological variance in any batch selected instead?

simpleSingleCell • 1.1k views

ADD COMMENT • link updated 6.5 years ago by Aaron Lun ★ 29k • written 6.5 years ago by Angelos Armen • 0

score 1 · Accepted Answer · 2019-08-15

Consider genes for which the null hypothesis is true, i.e., there is no biological variability such that the total variance is equal to the technical component determined by the mean-variance trend. The estimate of the variance, however, will fluctuate around the true value, meaning that this gene will have a positive biological component ~50% of the time.

For an analysis of a single batch, that's fine - retaining some of these uninteresting genes is part of the cost we have to pay for retaining as much biological signal as possible. However, this adds up prohibitively for multiple batches. For example, if we had 3 batches, a null gene would get a positive biological component in at least one batch ~90% of the time. Eventually, if we had enough batches, every gene would get a positive biological component just by chance and be retained.

Taking the average biological component aims to mitigate this effect. If a gene is genuinely highly variable in at least one batch, it will have a high biological component in that batch. Then, the chances are good that the average biological component will be positive and the gene will be retained. However, if a gene is null in all batches, the average component is still only likely to be positive ~50% of the time, so no harm done.