First, this is the task I need to finish:
Download gmt files from MSigDB, one is hallmark gene sets and the other is Canonical pathways under C2:curated gene sets.
Prepare data inputs, and one is the raw data counts gathered from STAR alignment results, the other one is the transformed data from raw counts and it is dealt by DESeq2 package. The transformation is $log_2(n+1)$.
Get enrichment scores by
gsva()function on both inputs and both gene sets with one cdf kernel to be Poisson and the other one to be Gaussian.
Then apply gsva scores to the limma package to fit a linear model by
lmFit()function. Then select pathways with adjusted p value smaller than 0.01 to be gene sets that are significant. This step is learnt from what GSVA vignette tutorial did in section 6.2.
Do comparison: compare significant sets overlapped and not overlapped. Then choose 5 sets with smallest adjusted p value that is only significant to one set of data input.
Plot the heatmaps of raw counts and transformed data input on those sets to see whether we can get any patterns to explain the differences on p value and significance.
Questions and Queries
Overall, I want to find out why transformation can lead to differences in p value.
By changing kernels from Poisson to Gaussian in gsva() function. I can see the enrichment scores changed. Then why the scores changed and even some of them changed from positive to negative?
How to interpret the linear models in limma package. It has coefficients explained as fold change. However, how to explain fold change in gsva scores (even for positive values and negative values) ? What is the meaning of coefficients and how to interpret the result of p values. It is not like the ordinary linear model that the coefficients are for prediction and t statistics and p values for showing whether these coefficients can be 0 or not. Then how to apply to this case that the predictors are factors? Also, this will lead to new question next.
In applying limma package functions, there are two set of design models, one with intercept and one without intercept. Then how to understand the differences of these two settings on coefficients and p values. Especially, what is p value for the intercept term.
If overall, this way for choosing significant sets doesn't work theoretically. Then how to choose gene sets that are significant? What test and which package should I use? Why in gsva tutorial used this way for selecting interesting gene sets?
Previously I got heatmaps on those sets that performed differently, but I can't see any good patterns to reach any conclusion. That's why I have all these questions. Also, another question is, what is a good heatmap and what we are expecting to see?
I guess what lead to the changes in gsva score is the empirical cumulative density function step in gsva, however, I can't find a way to prove my guess because, the ecdf calculation are calculated behind gsva() so that I can't call ecdf calculations myself, since for different kernels, it used different formulas but not just ecdf(). Different numbers will lead to different ranking so that the KS statistic changed absolutely. However, does transformation make sense? Does this calculation make sense?
I know from posts in Bioconductor Forum that the explanation of logFC in gsva scores are tricky and one uses p-value because it has precise interpretation. However, does these p-value cutoff method make sense at all?
Since those are gene sets with smallest p value which means significant to our data, then we should expect to see some differences in heatmap. For me, I can't find out those clear differences, thus I am confused for now.
I guess this is not the right way to choose significant sets but I can't figure out why the tutorial used this way and why this won't work for this case.
I can provide any code if necessary. I read the GSVA paper but I just can't completely understand the whole thing. Is that because my lack of patience or intelligence? I hope I expressed my questions clearly and if there is any confusion, I can try my best to explain more.