Should 0 values for gene counts be removed prior to ssGSEA/GSVA analysis?
1
0
Entering edit mode
João • 0
@9504eb7d
Last seen 18 days ago
United Kingdom

Since the ssGSEA/GSVA algorithms work by determining how much more expressed the genes of our gene list are when compared to all other genes within the sample, should remove genes with 0 counts in each individual sample before running the algorithm?

Say gene x is present in sample 1 but not sample 2, should we omit it from sample 2's calculations but keep it for sample 1? (i.e. replace all 0 with "NA")

In theory, if we have 2 samples with the exact same expression of our genes of interest but sample 1 has 1000 non-0 value genes and sample 2 has 500 non-zero value genes and 500 0-value genes, not removing the 0s would give the same score to both samples, but sample 2 clearly behaves differently.

Should we remove these 0 count genes?

GSEA GSVA • 178 views
ADD COMMENT
0
Entering edit mode
Robert Castelo ★ 2.9k
@rcastelo
Last seen 10 days ago
Barcelona/Universitat Pompeu Fabra

Hi, we advice to remove lowly-expressed genes in the same way you would do it before a differential expression analysis. This has been previously discussed in this forum, see for instance this post or this other one.

ADD COMMENT
0
Entering edit mode

Thank you for your answer, but I am afraid this does not answer my question.

For example, if gene A is 0 counts in sample 1 and around 1000 counts in all other samples, it is not a low-expression gene in general, so I cannot simply remove it from all samples.

However, I will be skewing the results for sample 1 if I include that gene in my calculations, whether that gene is in my gene set list or not (for this example, lets assume it's not), but If I remove gene A from the calculation for sample 1 alone, then the result will also be skewed as it is taking a smaller number of total "outside" genes into account when calculating the final score.

I understand that removing a single gene has a negligible effect overall but if we apply this reasoning to all genes in our samples, it could really skew our results.

Therefore the question is: which of these approaches gives a more meaningful result and is there any other approach that I am not thinking of to solve this?

ADD REPLY

Login before adding your answer.

Traffic: 414 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6