My PI is interested in comparing groups of cells on a cell-to-cell basis using pseudobulking, and our groups have differing numbers of cells. For that reason, I suggested factoring in cell numbers into the normalization process to also generate DE results at the tissue level. To exemplify, I made the following table to simulate some of our data. The example has two groups of cells (A and B), and there are 3 cells in group A and 6 cells in group B. The values are raw counts per cell. Groups A and B refer to the same cell type (let's say hepatocytes), but in different samples corresponding to different experimental conditions.
Gene Group A Group B Glul 9 9 9 0 0 0 1 0 0 Airn 6 9 10 1 1 2 1 3 1 Lgr5 7 7 8 4 5 5 5 4 3 Gapdh 5 5 5 4 5 5 5 4 4
When these cells are pseudobulked by sample, the following table is generated.
Gene Group A Group B Glul 27 1 Airn 25 9 Lgr5 22 26 Gapdh 15 27
Since group A has half as many cells as group B but the total cells are approximately the same in both experimental conditions, which is also reflected in our lab data at the tissue level, I proposed dividing the default normalization factors of group B by the following value Z to obtain tissue-level differential expression results.
X = Ratio of group A hepatocytes to total cells in condition 1 = 30/300 = 0.10 Y = Ratio of group B hepatocytes to total cells in condition 2 = 61/300 = 0.22 Z = Y/X = 2.2
I believe the default normalization factors allow for cell-to-cell comparison between each pseudobulked sample. To perform comparisons at the tissue level, I think dividing group B's default normalization factor by this Z value should accurately highlight the strength of gene expression differences at the tissue level, as halving the normalization factor for a group means that the group's gene expression is now doubled, and when the proportion of certain cells comprising the total amount of cells in a group is doubled, the total gene expression of these cells should scale linearly.
In other words, if I am comparing cell A to cell B, cell B has twice as much expression of a particular gene, and there are also twice as many cells corresponding to cell B in cell B's sample than there are cells corresponding to cell A in cell A's sample, then the total gene expression fold change between all A cells and all B cells for that particular gene should be 4 times (2 times between cells A and B alone x 2 times the number of B cells vs. A cells).
Does this make sense? Please let me know your thoughts and suggestions. Best, Skanda