Hello members,
I am currently working with TPM expression data obtained from RNA-seq analysis, and my dataset includes a diverse range of biotypes such as miRNA, lncRNA, pseudogenes, etc., resulting in a total of around 60,000 genes. As I intend to perform enrichment analysis (ssGSEA) using the hallmark gene list from MSigDB, I am faced with a crucial decision regarding whether to filter the data based on biotype='protein coding'.
Given the diverse nature of the genes in my dataset, I am uncertain about the potential impact of including non-protein coding biotypes on the enrichment analysis. Filtering by biotype='protein coding' seems like a logical step to focus on protein-coding genes relevant to the hallmark pathways, but I would like to seek the community's advice and experiences on this matter.
Here are some specific questions to guide the discussion:
In the context of hallmark pathway enrichment analysis, what are the potential advantages and disadvantages of including non-protein coding genes in the dataset?
Has anyone encountered similar scenarios with a diverse set of biotypes in RNA-seq data, and if so, what criteria did you use for gene filtering, especially concerning biotypes?
Are there specific biotypes, such as miRNA, lncRNA, or pseudogenes, that are known to significantly impact or contribute to hallmark pathway enrichment analysis?
How does the choice of gene filtering criteria, specifically regarding biotype, affect the biological interpretation of enrichment analysis results using hallmark gene sets?
I appreciate any insights, experiences, or recommendations the community can provide to help me make an informed decision on whether to filter my RNA-seq data by biotype='protein coding' for hallmark pathway enrichment analysis.
Thank you in advance for your assistance!
ssgsea_results = gsva(expr = as.matrix(expression_matrix),
gset.idx.list = gene_sets,
method = "ssgsea",
# kcdf = "Gaussian" if we have expression values that are continuous such as log-CPMs, log-RPKMs or log-TPMs, kcdf = "Poisson" on integer counts.
kcdf = "Poisson",
# Minimum gene set size
min.sz = 15,
# Maximum gene set size
max.sz = 500,
# Compute Gaussian-distributed scores
mx.diff = TRUE,
# Don't print out the progress bar
verbose = T)
Hi, a follow up for your question, have you tried using fgsea library for the ssGSEA analysis or did you directly use gsva? I am particularly using the fgseaMultiLevel function and I imported the gene sets from msigdb. My question is: how do you preprocess your data before applying ssGSEA function to it? Do you perform normalization, log transformation or just input raw counts matrix after filtering?
Hi, the fgsea package does not implement the
ssGSEA
algorithm, but the GSVA package does, concretely, the original version by Barbie et al. (2009), described in the subsection "Signature Projection Method" from the Online Methods. I'd say probably ssGSEA works best with normalized logCPM or logTPM units of expression, but we did not develop ssGSEA, so you may get a more authorative answer in the official support site for ssGSEA, which I believe is this Google Group.