Hi all,
I have a question that don't know why, hope you can help. I use GSVA (Gene Set Variation Analysis) package to calculate pathway scores. Then I compare pathway scores between 3 groups using limma. When I use a few thousand pathways, raw p value of pathway A will different (bigger) with when I use only a few pathways. I think it should be the same because each pathway is independent (adj p value changes because number of comparison change). Basically, it is a Anova test (anova like F test). Can't use Deseq2 because it apply to count data only. I manually run aov()
for each pathway and the weird things is in some cases, p value almost the same, some cases, p value larger, some cases, p value smaller. Would you please have an explanation and which practice I should do in this case? Thank you so much!
Thanks ATpoint for you advice. Just don't understand the math behind the change in raw p value in this case. I think it is important to understand because it helps me decide which way should go. It likes find if salty in 3 type of pizza is different, I got a raw p value. Then why adding to find if salty in 3 type of chicken is different change the raw p value of pizza. A naive thought pizza and chicken is independent.
It seems that you are thinking that when you look at the results of limma for a particular gene, or pathway if you want, those results were derived using only the data from that very gene or that very pathway and that's not true. limma uses a empirical Bayes procedure by which all data points from the whole dataset are employed to get more robust estimates of the variabiliy of each gene, or pathway when you feed limma with GSVA scores. For that reason, depending what data you feed into limma, the results may change. ATpoint already pointed you to a previous post discussing this matter. Ultimately, if you want to understand the math, you'll have to read papers.
Thanks Robert! Please correct me if I am wrong, so using more pathways (feeding more data in limma) gives us more accurate raw p value?
Yes, it is important that you feed all the pathways into limma, so that limma can robustly estimate the variances, even if you only going to use the results for a few selected pathways.