Question: Feature selection across batches in simpleSingleCell "Correcting batch effects" vignette
0
4 weeks ago by
Angelos Armen0 wrote:

In the Feature selection across batches section of the simpleSingleCell Correcting batch effects vignette, genes with positive average (across batches) biological variance are selected. What is the reasoning behind that? Why aren't genes with positive biological variance in any batch selected instead?

simplesinglecell • 67 views
modified 4 weeks ago by Aaron Lun24k • written 4 weeks ago by Angelos Armen0
Answer: Feature selection across batches in simpleSingleCell "Correcting batch effects"
1
4 weeks ago by
Aaron Lun24k
Cambridge, United Kingdom
Aaron Lun24k wrote:

Consider genes for which the null hypothesis is true, i.e., there is no biological variability such that the total variance is equal to the technical component determined by the mean-variance trend. The estimate of the variance, however, will fluctuate around the true value, meaning that this gene will have a positive biological component ~50% of the time.

For an analysis of a single batch, that's fine - retaining some of these uninteresting genes is part of the cost we have to pay for retaining as much biological signal as possible. However, this adds up prohibitively for multiple batches. For example, if we had 3 batches, a null gene would get a positive biological component in at least one batch ~90% of the time. Eventually, if we had enough batches, every gene would get a positive biological component just by chance and be retained.

Taking the average biological component aims to mitigate this effect. If a gene is genuinely highly variable in at least one batch, it will have a high biological component in that batch. Then, the chances are good that the average biological component will be positive and the gene will be retained. However, if a gene is null in all batches, the average component is still only likely to be positive ~50% of the time, so no harm done.