I would like to know what exactly should be the set of genes that go as input in the expression matrix to GSVA (not the gene set but the expression matrix). I am working particularly within the context of single-cell data, and have marker genes for two groups. I would like to find out the differential pathways between these two groups. First, I was thinking, since I have already identified a set of marker genes (about a 100 genes for each group), it would make sense to use the union of the two marker gene sets to form an expression matrix and use that as the GSVA input. But now after reading more about how GSVA/GSEA work, I feel that the entire raw set of genes (about 14000 of them) should be used to form the gene expression matrix so that the enrichment results would be stronger, and would be carried out with more appropriate background distributions.
Is my understanding correct?
It would be great if somebody could explain what exactly should be the set of genes going into the GSVA expression matrix, is it better to give a restricted list or the entire list? Does "the more the merrier" apply here?
The question on having "more appropriate background distributions" has more to do with having sufficient sample size for the first step in which GSVA evaluates whether a gene is highly or lowly expressed in a sample. But I understand your question is rather about the input number of genes. As described in the GSVA paper, "GSVA calculates sample-wise gene set enrichment scores as a function of genes inside and outside the gene set, analogously to a competitive gene set test". This means that the GSVA scores depend on what genes fall inside *and* outside the gene set. To put an extreme example, if you would have only genes from one gene set, there would be no way for the GSVA method to evaluate the difference between being inside and outside the gene set. On the other hand, If you have all possible genes, where an important fraction of them are not expressed, gene sets including those unexpressed genes will have scores close to 0. In general, I recommend to filter genes out much in the same way you would do it in a differential expression analysis.