Question

Discrepancy in GSVA for Nanostring when using normalized counts vs rlog as input

0

Entering edit mode

hanyin.wang88 ▴ 10

@user-24750

Last seen 3.2 years ago

Hey community members,

I have a question using GSVA for Nanostring data (789 targeted genes per sample). I know previously there was discussion here about using GSVA for Nanostring, and the big thing is to decide appropriate argument for kcdf (https://support.bioconductor.org/p/111096/).

My question is: 1) In literatures, I have seen people using both GSVA and ssGSEA for Nanostring and I am not sure why. I am wondering if there is a preferred method for targeted gene panel in general. 2) I used DESeq2 in my pipeline. Interestingly I got very different GSVA results when using normalized counts as input (with kcdf = "Poisson") versus when using rlog as inputs (with kcdf = "Gaussian") . I am wondering how to explain the discrepancy and which method is better.

Many thanks for the kind help.

GSVA • 1.7k views

ADD COMMENT • link 3.2 years ago hanyin.wang88 ▴ 10

score 3 · Accepted Answer · 2021-02-10

3

Entering edit mode

Robert Castelo ★ 3.3k

@rcastelo

Last seen 3 days ago

Barcelona/Universitat Pompeu Fabra

hi,

i have no experience with Nanostring data but i'll try to answer your questions about GSVA:

1) because GSVA attempts to calculate an expression statistic that brings genes with a different dynamic range to a common scale, it requires a minimum number of samples to obtain a robust estimate of that statistic. As a result of some statistical power simulations we recommend at least 10 samples (see end of the first paragraph of the discussion of the GSVA paper. i could imagine that when researchers do not have such a minimum sample size, they rely on ssGSEA, which does not carry out that analysis step and therefore, doesn't have that sample size requirement.

2) a very similar question was already discussed in a previous post, where i showed with a simulation that normalized counts and corresponding normalized log-CPM units of expression where leading to nearly identical GSVA scores (r > 0.97), but you're talking about the rlog transformation, which leads to something different than log-CPM units of expression. you would have to show us your code to further discuss this issue.

you might be interested in reading this article (preprint version) on performing quality control and normalization of Nanostring data. also, if you search in the Bioconductor software page, you'll find a number of packages for the analysis of Nanostring data. recently, the package NanoStringNCTools has been added into the devel version of Bioconductor, which will become release in the coming months.

cheers,

robert.

ADD COMMENT • link 3.2 years ago Robert Castelo ★ 3.3k

0

Entering edit mode

Thank you for the kind reply Robert. This is tremendously helpful.

As you sharply pointed out, I think the issue may be that I have a small sample size (n=10). In this case, I will look into ssGSEA which may give more robust result.

Thanks for sharing the previous post, I will try to calculate r score myself first before I bother you.

Thanks for pointing out the latest preprint article on Nanostring. I actually used the method described in this article for my analysis, and have communicated closely with the author. Their results for gene DE analysis worked beautifully, and I am just wondering how to proceed with pathway analysis therefore looked into GSVA.

May I ask a separate question? In the gsva(), the argument min.size has the explanation of "Minimum size of the resulting gene sets". However from my testing, it appears min.size defined the minimal overlap genes in a given gene sets. So for example if I set a min.size =3, only those gene sets with at least 3 genes overlapping with my input data will be included. Is this understanding correct?

Deeply appreciate the kind assistance.

ADD REPLY • link 3.2 years ago hanyin.wang88 ▴ 10

2

Entering edit mode

Yes, your understanding is correct. First, genes in gene sets are mapped to genes in the expression data, which implies that some gene sets may lose genes for which there are not expression profiles, even some of those gene sets might become empty. Second, gene sets are filtered by the given minimum and maximum sizes, which by default are set to 1 and infinity, respectively. If you are satisfied with the answer, it's a good practice to upvote it and accept it. This not only shows appreciation for the work of developers giving support to their software, but also helps others to more easily identify questions that have been already answered. Thanks!

ADD REPLY • link 3.2 years ago Robert Castelo ★ 3.3k

1

Entering edit mode

Thank you so much Robert! This is my first question in the forum and deeply appreciate you guidance. Just learnt how to upvote and accept! Thanks for all the answers!

ADD REPLY • link 3.2 years ago hanyin.wang88 ▴ 10