Hello members,
I am currently working with TPM expression data obtained from RNA-seq analysis, and my dataset includes a diverse range of biotypes such as miRNA, lncRNA, pseudogenes, etc., resulting in a total of around 60,000 genes. As I intend to perform enrichment analysis (ssGSEA) using the hallmark gene list from MSigDB, I am faced with a crucial decision regarding whether to filter the data based on biotype='protein coding'.
Given the diverse nature of the genes in my dataset, I am uncertain about the potential impact of including non-protein coding biotypes on the enrichment analysis. Filtering by biotype='protein coding' seems like a logical step to focus on protein-coding genes relevant to the hallmark pathways, but I would like to seek the community's advice and experiences on this matter.
Here are some specific questions to guide the discussion:
In the context of hallmark pathway enrichment analysis, what are the potential advantages and disadvantages of including non-protein coding genes in the dataset?
Has anyone encountered similar scenarios with a diverse set of biotypes in RNA-seq data, and if so, what criteria did you use for gene filtering, especially concerning biotypes?
Are there specific biotypes, such as miRNA, lncRNA, or pseudogenes, that are known to significantly impact or contribute to hallmark pathway enrichment analysis?
How does the choice of gene filtering criteria, specifically regarding biotype, affect the biological interpretation of enrichment analysis results using hallmark gene sets?
I appreciate any insights, experiences, or recommendations the community can provide to help me make an informed decision on whether to filter my RNA-seq data by biotype='protein coding' for hallmark pathway enrichment analysis.
Thank you in advance for your assistance!
ssgsea_results = gsva(expr = as.matrix(expression_matrix),
gset.idx.list = gene_sets,
method = "ssgsea",
# kcdf = "Gaussian" if we have expression values that are continuous such as log-CPMs, log-RPKMs or log-TPMs, kcdf = "Poisson" on integer counts.
kcdf = "Poisson",
# Minimum gene set size
min.sz = 15,
# Maximum gene set size
max.sz = 500,
# Compute Gaussian-distributed scores
mx.diff = TRUE,
# Don't print out the progress bar
verbose = T)