Should 0 values for gene counts be removed prior to ssGSEA/GSVA analysis?
Entering edit mode
João • 0
Last seen 21 days ago
United Kingdom

Since the ssGSEA/GSVA algorithms work by determining how much more expressed the genes of our gene list are when compared to all other genes within the sample, should remove genes with 0 counts in each individual sample before running the algorithm?

Say gene x is present in sample 1 but not sample 2, should we omit it from sample 2's calculations but keep it for sample 1? (i.e. replace all 0 with "NA")

In theory, if we have 2 samples with the exact same expression of our genes of interest but sample 1 has 1000 non-0 value genes and sample 2 has 500 non-zero value genes and 500 0-value genes, not removing the 0s would give the same score to both samples, but sample 2 clearly behaves differently.

Should we remove these 0 count genes?

GSEA GSVA • 185 views
Entering edit mode
Robert Castelo ★ 2.9k
Last seen 13 days ago
Barcelona/Universitat Pompeu Fabra

Hi, we advice to remove lowly-expressed genes in the same way you would do it before a differential expression analysis. This has been previously discussed in this forum, see for instance this post or this other one.

Entering edit mode

Thank you for your answer, but I am afraid this does not answer my question.

For example, if gene A is 0 counts in sample 1 and around 1000 counts in all other samples, it is not a low-expression gene in general, so I cannot simply remove it from all samples.

However, I will be skewing the results for sample 1 if I include that gene in my calculations, whether that gene is in my gene set list or not (for this example, lets assume it's not), but If I remove gene A from the calculation for sample 1 alone, then the result will also be skewed as it is taking a smaller number of total "outside" genes into account when calculating the final score.

I understand that removing a single gene has a negligible effect overall but if we apply this reasoning to all genes in our samples, it could really skew our results.

Therefore the question is: which of these approaches gives a more meaningful result and is there any other approach that I am not thinking of to solve this?


Login before adding your answer.

Traffic: 377 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6