Question

Should the expression data be filtered for `Protein-Coding Genes` before Running ssGSEA?

0

Entering edit mode

Bioinformagician • 0

@2c913ccf

Last seen 13 months ago

India

I'm planning to run ssGSEA on TPM expression data from an RNA-seq analysis. My dataset includes around 60,000 genes, encompassing not just protein-coding genes but also other biotypes like miRNA, lncRNA, pseudogenes, etc. Since the hallmark gene sets from MSigDB only consist of protein-coding genes, should I filter my expression data to include only protein-coding genes before running the ssGSEA analysis? Or does the GSVA package automatically handle the exclusion of non-protein-coding genes when calculating enrichment scores? What would be the best practice here?

ssgsea GSVA gsea • 1.3k views

ADD COMMENT • link updated 17 months ago by Axel Klenk ★ 1.1k • written 17 months ago by Bioinformagician • 0

score 0 · Answer 1 · 2024-08-08

Hi,

the GSVA package has no notion of protein-coding genes and the only thing it will remove automatically are genes with zero variance across all samples (or across all non-zero values for sparse data) -- however, method ssGSEA is an exception where such genes with constant expression will trigger a warning but will not be removed automatically and it is up to the user to do so if they wish.

We usually recommend that, in addition, users filter out genes with very low expression values to reduce noise as well as memory footprint and computation time.

If during exploratory analysis certain biotypes would be found to have consistently higher (or lower) expression values than the protein-coding genes in the gene sets, I'd be inclined to remove those as well to avoid biasing the results, at least for methods ssGSEA and GSVA.