Regarding the approach to classify tumor samples to subtypes based on different gene sets with gene expression data, my question is: what is the significance of the approach when not all the genes of the gene sets are found in the samples? I believe that it may affect significantly the results, but still this is not reported when results are presented, for example when doing GSVA. This happened to me quite often, in particular when using data from TCGA or ICGC. So far, I have only done that using GSVA package, with method 'gsva'. Has anyone done testing on this issue?
Note: cross-posting to biostars