I'm planning to run ssGSEA on TPM expression data from an RNA-seq analysis. My dataset includes around 60,000 genes, encompassing not just protein-coding genes but also other biotypes like miRNA, lncRNA, pseudogenes, etc. Since the hallmark gene sets from MSigDB only consist of protein-coding genes, should I filter my expression data to include only protein-coding genes before running the ssGSEA analysis? Or does the GSVA package automatically handle the exclusion of non-protein-coding genes when calculating enrichment scores? What would be the best practice here?
