Question

pre-filtering gene sets for GSVA/ssGSEA

1

Entering edit mode

igor ▴ 50

@igor

Last seen 6 days ago

United States

The expression matrix and gene sets for pathway analysis usually come from different sources. For GSVA/ssGSEA, how reasonable is it to filter gene sets for only the genes that are present in the expression matrix? If certain genes (or gene symbols) are not in your reference or are not being detected for technical reasons, it makes sense to remove them. It looks like it's done in the original publication:

We further filtered genes with low expression by discarding those with a mean of less than 0.5 counts per million calculated in log2 scale ... After mapping genes from an experiment to the gene set database, we ignore all gene sets with fewer than 10 genes or more than 500 genes.

And it looks like it's done in the code automatically.

On the other hand, if you remove all the non-expressed genes, wouldn't that automatically make the gene set more enriched? For example, you get a certain score if 10 of 100 genes in a gene set are highly expressed. If you remove the other 90, now all genes in that gene set are highly expressed, which should increase the score. Is it just prioritizing false positives over false negatives?

gsea ssgsea gsva pathways • 3.3k views

ADD COMMENT • link updated 5.0 years ago by Robert Castelo ★ 3.4k • written 5.0 years ago by igor ▴ 50

score 1 · Accepted Answer · 2020-07-14

hi,

I would say that, just as in differential expression analysis, as long as your filtering criteria is independent of the summarizing statistic, you should be safe. If you are worried about your summarizing statistic not being representative of the gene set, you could also further filter out gene sets with an insufficient representation of genes not only in absolute terms (e.g., at least 10 genes), but also in relative terms (e.g., at least 50% of the genes forming the gene set should be expressed in my dataset). This of course assumes that your gene set definition is accurate, what if the gene set is accurate for a particular tissue or experimental condition but is too large for the one you're studying? You could be then wrongly discarding the gene set with that strategy.

cheers,

robert.