I'm using GSVA to get an overall activity/expression pattern of various gene sets using RNA-seq data. This is generally OK, but in my case, about half of the gene sets have only two genes, and I definitely don't want to discard them. As the GSVA help documentation recommends a minimum of 5 genes in the set, I was wondering if I'm getting useful/meaningful results at all for these very small gene sets. What do you think? If GSVA is not good in this case, are there any alternative methods that might be useful or should I just go with a summary of normalized TPM values for example?
Thanks for any suggestion!
Hi Robert!
Yes, they do look a bit problematic. Just checking the distribution of GSVA scores for sets with two genes, shows that most of them have a value close to -1 or 1, but nothing in between.
I've seen that you had some other suggestions for summarizing expression valus, in this post.
Do you know any review paper or benchmarks on this topic? I think we are going to try a few, besides various options in GSVA.
We compared z-scores, first right-singular vector and ssGSEA in the GSVA paper and you will find them implemented in the GSVA package via the
method
argument. I've seen quite a few papers on benchmarking GSEA methods but I don't think they benchmark against having two genes per gene set, so I'm not sure how useful those papers may be for you.