First of all, thanks for this great GSVA package.
When I ran GSVA (mx.diff=T) on healthy and diseased samples together, the enrichment scores are quite different from when I ran GSVA on healthy and diseased samples separately.
For example, when I plotted the distribution of enrichment scores for diseased samples when the samples were ran using GSVA separately from the healthy samples, i get a bimodal (a peak at somewhere ES > 0 and one at somewhere ES < 0 ).
When the diseased samples were ran using GSVA together with the healthy samples, the distribution of ES for diseased is mostly unimodal, for example ES > 0 (for genes in gene set that should be upregulated in disease vs. in healthy).
I looked at Fhat GSVA paper equation (1) and see that it is dependent on the distribution of the gene in all samples. So i tried to map out the intermediate steps to understand what each output is-- basically I see that when a gene that is expressed highly when compared to other genes within a diseased sample itself, but lowly expressed as compared to healthy samples and similarly expressed to other disease samples, the ranking of the gene dropped within the sample. However, when no healthy samples were provided, the rank of the gene is higher within the sample.
So, when only diseased samples are included, does it "exaggerate" the difference of the gene expression present in the gene set between diseased samples, hence resulting in the bimodal distribution of ES?
When diseased samples vs. healthy are ran together, technically healthy gene expression in the gene set should be quite different from diseased, hence it gives some weight to the genes in the diseased /healthy that results in the ES separating out disease from healthy.
Do i have the right picture or am I misunderstanding something?