Question

How should I run ssGSEA Hallmark Enrichment Analysis?

2

Entering edit mode

snijesh ▴ 200

@snijesh-20358

Last seen 7 months ago

India

Hello members,

I am currently working with TPM expression data obtained from RNA-seq analysis, and my dataset includes a diverse range of biotypes such as miRNA, lncRNA, pseudogenes, etc., resulting in a total of around 60,000 genes. As I intend to perform enrichment analysis (ssGSEA) using the hallmark gene list from MSigDB, I am faced with a crucial decision regarding whether to filter the data based on biotype='protein coding'.

Given the diverse nature of the genes in my dataset, I am uncertain about the potential impact of including non-protein coding biotypes on the enrichment analysis. Filtering by biotype='protein coding' seems like a logical step to focus on protein-coding genes relevant to the hallmark pathways, but I would like to seek the community's advice and experiences on this matter.

Here are some specific questions to guide the discussion:

In the context of hallmark pathway enrichment analysis, what are the potential advantages and disadvantages of including non-protein coding genes in the dataset?
Has anyone encountered similar scenarios with a diverse set of biotypes in RNA-seq data, and if so, what criteria did you use for gene filtering, especially concerning biotypes?
Are there specific biotypes, such as miRNA, lncRNA, or pseudogenes, that are known to significantly impact or contribute to hallmark pathway enrichment analysis?
How does the choice of gene filtering criteria, specifically regarding biotype, affect the biological interpretation of enrichment analysis results using hallmark gene sets?

I appreciate any insights, experiences, or recommendations the community can provide to help me make an informed decision on whether to filter my RNA-seq data by biotype='protein coding' for hallmark pathway enrichment analysis.

Thank you in advance for your assistance!

ssgsea_results = gsva(expr = as.matrix(expression_matrix),
                        gset.idx.list = gene_sets,
                        method = "ssgsea",
                        # kcdf = "Gaussian" if we have expression values that are continuous such as log-CPMs, log-RPKMs or log-TPMs, kcdf = "Poisson" on integer counts. 
                        kcdf = "Poisson",
                        # Minimum gene set size
                        min.sz = 15,
                        # Maximum gene set size
                        max.sz = 500,
                        # Compute Gaussian-distributed scores
                        mx.diff = TRUE,
                        # Don't print out the progress bar
                        verbose = T)

ssgsea GSVA biotype proteincoding msigdb • 2.1k views

ADD COMMENT • link updated 7 weeks ago by Ginny • 0 • written 11 months ago by snijesh ▴ 200

0

Entering edit mode

Hi, a follow up for your question, have you tried using fgsea library for the ssGSEA analysis or did you directly use gsva? I am particularly using the fgseaMultiLevel function and I imported the gene sets from msigdb. My question is: how do you preprocess your data before applying ssGSEA function to it? Do you perform normalization, log transformation or just input raw counts matrix after filtering?

ADD REPLY • link 5 months ago Tanvi • 0

0

Entering edit mode

Hi, the fgsea package does not implement the ssGSEA algorithm, but the GSVA package does, concretely, the original version by Barbie et al. (2009), described in the subsection "Signature Projection Method" from the Online Methods. I'd say probably ssGSEA works best with normalized logCPM or logTPM units of expression, but we did not develop ssGSEA, so you may get a more authorative answer in the official support site for ssGSEA, which I believe is this Google Group.

ADD REPLY • link 4 months ago Robert Castelo ★ 3.4k

score 2 · Answer 1 · 2023-11-29

Hi,

While you are not naming it explicitly in your question, from your tags and the line of code you've pasted, I assume you are interested in using the implementation of the ssGSEA method available in the GSVA Bioconductor package. In this package, the function gsva() is going to filter out rows in your expression_matrix object for which there are no corresponding identifiers in the gene_sets object. Therefore, unless I'm misunderstanding your question, there is no much to worry about non-protein-coding genes from the perspective of using the software. If there is an expression profile in your RNA-seq data set for a non-protein-coding gene, which also forms part of a gene set in the MSigDB Hallmark collection, then 'gsva()' will use it and I do not see a reason to exclude it. By the way, in the line of code you write, there is no need to specify the kcdf and mx.diff parameters when method="ssgsea", since kcdf and mx.diff only apply when method="gsva".

This has been a common misunderstanding throughout the years and in the last release of GSVA we have deprecated this interface, in favor of an object-oriented one that does not allow the user to do this anymore. For instance, in the new interface (available in GSVA 1.50, Bioc release 3.18), to use the ssGSEA method with the inputs you show in your question, you should write the following:

ssgseapar <- ssgseaParam(as.matrix(expression_matrix), gene_sets)
ssgsea_es <- gsva(ssgseapar)

You will find full details on how to use the new interface in the help page of gsva() and in the vignette of the package.

cheers,

robert.