I'm planning to run ssGSEA
on TPM expression data from an RNA-seq analysis. My dataset includes around 60,000 genes, encompassing not just protein-coding
genes but also other biotypes like miRNA, lncRNA, pseudogenes
, etc. Since the hallmark gene sets
from MSigDB only consist of protein-coding
genes, should I filter my expression data to include only protein-coding genes before running the ssGSEA analysis? Or does the GSVA
package automatically handle the exclusion of non-protein-coding genes when calculating enrichment scores? What would be the best practice here?