Hi,
First of all, thanks for this great GSVA package.
When I ran GSVA (mx.diff=T) on healthy and diseased samples together, the enrichment scores are quite different from when I ran GSVA on healthy and diseased samples separately.
For example, when I plotted the distribution of enrichment scores for diseased samples when the samples were ran using GSVA separately from the healthy samples, i get a bimodal (a peak at somewhere ES > 0 and one at somewhere ES < 0 ).
When the diseased samples were ran using GSVA together with the healthy samples, the distribution of ES for diseased is mostly unimodal, for example ES > 0 (for genes in gene set that should be upregulated in disease vs. in healthy).
I looked at Fhat GSVA paper equation (1) and see that it is dependent on the distribution of the gene in all samples. So i tried to map out the intermediate steps to understand what each output is-- basically I see that when a gene that is expressed highly when compared to other genes within a diseased sample itself, but lowly expressed as compared to healthy samples and similarly expressed to other disease samples, the ranking of the gene dropped within the sample. However, when no healthy samples were provided, the rank of the gene is higher within the sample.
So, when only diseased samples are included, does it "exaggerate" the difference of the gene expression present in the gene set between diseased samples, hence resulting in the bimodal distribution of ES?
When diseased samples vs. healthy are ran together, technically healthy gene expression in the gene set should be quite different from diseased, hence it gives some weight to the genes in the diseased /healthy that results in the ES separating out disease from healthy.
Do i have the right picture or am I misunderstanding something?
Thanks,
Jenn
Hi Robert,
Thanks for the insights. To answer your questions,
I was referring to the distribution of scores across samples. https://ibb.co/7pT6b31 Lowly expressed have not been filtered out, i.e. I included all genes regardless of their expression. I have 83 tissue biopsy samples ( 36 healthy and 47 disease). They have been normalized by using neqc in limma because these are illumina bead chips microarrays data. 182 genes in Gene set A.
Thanks again!
Hi Jenn,
I would recommend to filter out lowly-expressed genes. Imagine a gene set for which a substantial fraction of its genes are not expressed above the detection threshold of the instrument throughout all the samples. In such a case, we are trying to get an estimate of relative expression for that gene set using a fraction of the data that is, in the case of a microarray, essentially fluorescence noise. Subsection 17.3 from the limma User's Guide illustrates how to process Illumina bead chip microarray data, including how to filter for lowly expressed genes after normalization with
neqc()
. In your case, I would keep probes that are expressed at least in 36 samples, which is the size of the smallest group.Your last message ends with the sentence "Distribution of ES scores". If this was intended to be the caption of some plot, the plot did not make it through, I can't see any plot. Please provide as well the exact call that you made to
gsva()
with its arguments and corresponding values and any warning or relevant message of thegsva()
function, if any.Sorry! The plot image is attached here https://ibb.co/7pT6b31
I actually did use the p value detection level in limma to filter out genes detected in at least 36 samples.
I have the expression data in a matrix form. The genes included in the gene set are in genes$V1
The titles of the plot are a bit confusing, I understand that the plots show the non-parametric density estimation of the GSVA enrichment scores for one gene set (the one you mentioned about with 182 genes, I guess), but which one of the plots is produced with the calculations made from all the data and which one is made from the subset of the data with disease samples only (corresponding to the
gsva()
function called you give before). Could you also post the whole set of instructions? (the two calls to thegsva()
function and the calls that produce the plot).Hi Robert,
Please see below:
Thanks!
Thanks, now I understand what you mean. However, the code above has an error in the call to the
grep()
function, concretely here:because
grep()
needs two arguments, I guess the actual code that should run is:since the code you gave is not reproducible makes me think that in that very step something may be going differently than you think.
could you check that the dimensions of the object
diseasedata
are the expected ones, i.e., have the number of samples corresponding to the subset of data that you are supposed to select?I'm saying this because I've made a simulation trying to recreate your data, which I'll add to my answer above, and I cannot see the effect you describe. By now, my only explanation is that you made a mistake in subsetting the data and your top plot corresponds to calculations made from the whole dataset. Please see my edited answer.
Hi Robert,
That was a typo, I was trying to replace the colnames with a more generic name, "disease", accidentally added a parenthesis there.
Thanks for the example given. Based on that, the "Subset data" ES scores distribution is unimodal at peak 0 whereas the "Subset results" ES scores distribution is unimodal peak at ~0.5.
Does this mean that gsva should be run on all the samples together instead of doing healthy and diseased separately? If I did diseased data only, the ES scores would be around zeros.
Although the method does not take into account the phenotypes info, the distribution of gene expression across samples would infer the phenotype info in calculating the enrichment scores, thus adjusting the scores? Thanks!
Hi Robert,
Just to add on top of my previous question. I think Illumina or Affy data are not normally distributed after normalization and removing lowly expressed genes, so I'll get a bimodal ES scores if I subset the healthy or diseased samples out before running GSVA. Agilent normalized data, however, are normally distributed and I can get the unimodal plot that you have up there if only I did it on agilent data.
hi Jenn,
yes, expression data is in general not normally distributed. What you select as a subset of disease samples may be formed by a very heterogeneous collection of cells. This means that you may have easily genes that are expressed in some of those disease samples and not expressed in other ones, leading to a non-unimodal distribution for your disease samples. I'm not surprised that different microarray technologies with distinct probe designs may capture those changes in different ways. Even probe annotations for a single technology change over time.
hi Jenn,
What you see in "Subset results" it's calculated exactly in the same way (check it on the code) as from "All data", but by subsetting the results you are simply zooming into the second mode.
When you give only a fraction of the data, then that's the dynamic range of expression that the algorithm is seeing with, in this case, very little variability and that's the reason why the distribution is symmetric and centered around zero.
Whether you should give as input all the samples or only a fraction of them depends on the question you want to answer. If you give to the algorithm a fraction of the data with no variability, the algorithm it's not going to capture any variability. If you give variability, it'll attempt to capture it. If that variability is important for what you're investigating, then better give it and give all the data. But I'd say this is in general for any kind of analysis, differential expression, etc.
Regarding your last question, think in terms of a single gene. When a single gene is upregulated in disease samples and downregulated in healthy samples, would you say that the gene "infers" the phenotype? I'd simply say that the gene is associated with the phenotype, but I might not have understood the question.
Hi Robert's
That's been very helpful. Thank you so much for the feedbacks and insights!