Hi all,
I use limma to compare pathway score (from GSVA tool) between 3 conditions (normal control, early state, late state), I got no difference. Each condition have multiple samples. But when I remove early state observation from the data matrix, there are pathways significant different between normal control and late state. Would you please explain why? I thought when limma compare 3 conditions, it will pairwise compare them so if 2 conditions have different, 3 conditions also should have. I think apply limma is better than using anova and also can work on other data than microarray and RNA-seq. Thank you so much!
Thank you for reply early on Saturday morning. Sorry if the question is not clear to you. Here is my code when compare 3 conditions:
GSVA is a tool that will give me a matrix with rows instead of genes, it is pathways name. The value instead of gene expression, it is pathway score.
Regarding your original question, limma results will change if you remove one of the conditions because the residual standard deviations will change. However you should not be removing any of the groups. You should instead explore why the early-state might be more variable than the other conditions using
plotMDS()
. If there are differences in variability between conditions, you might want to usearrayWeights()
to account for that.The limma analysis of GSVA pathway scores can be improved by allowing limma to take set size into account when modelling the variance. To do that, use
trend = sqrt(setsize)
when runningeBayes()
, wheresetsize
is the number of genes in the pathway.The analysis of pathway scores isn't terribly powerful unless you have large sample sizes. Have you considered analysing the pathways directly in limma using
camera()
?Thank you so much for a detail reply! I analyze thousand of pathways so the number of genes vary a lot, so how can I use
setsize
in this case?setsize
is a vector, of length equal to the number of rows of data, i.e., the number of pathways. It gives the size of each individual pathway.If you type
?eBayes
the help page says: "trend can be a row-wise numeric vector, which will be used as the covariate for the prior variance"Thanks Gordon! I takes the pathways from MSigDB and I think they don't mention how many genes in a pathway. At the beginning, I want to compare if pathway score different between groups using anova test. Then I think limma maybe better even though people use it for microarray and RNA-seq but not pathway score.
Yes, limma is much better than univariate anova and it is used to analyse pathway scores in the GSVA vignette. I am just giving you suggestions that would do even better.
Pathways are just lists of genes in R and there are various ways to count how many genes are in each set. For example:
If you don't know how to do this with GSVA, you could ask the GSVA authors.
I see. I ran different expression at pathway level without notice that limma being used. Thank you. Will try
camera()
. Do we have anything explain a little bit easier than this? https://www.rdocumentation.org/packages/limma/versions/3.28.14/topics/cameraI adjusted to follow reply by email but it didn't work, so I have to manually check the post if someone reply.
Examples of camera workflows:
By the way, have you considered simply typing
?camera
at the R prompt to see the help page for the function? Instead of doing a Google search and reading the help page from a decade ago.Hi Gordon,
I don't have rds file but gmt file https://data.broadinstitute.org/gsea-msigdb/msigdb/release/7.5/c2.all.v7.5.entrez.gmt. So may I know the equivalent function with readRDS to count the number of gene? Thanks Gordon.
Hi Chris, here the GSVA maintainer, getting the gene set lengths that match the gene sets employed with GSVA may not be obvious for many users, so here is a way to do it from the GMT file you're referring to:
now you should be able to use
setsize
as argument to thetrend
parameter in the call to theeBayes()
function from the limma-trend pipeline.still, the final gene set sizes employed by
gsva()
might be slightly different to the input ones, due to internal filtering, I've opened an issue at the GSVA GitHub repo to give a solution to this problem.Thanks Robert! I successful ran limma with
setsize
parameter and the result slightly change. So the difference in pathway score between groups also depend on the number of genes in each pathway? Gordon explained above about the difference when comparing 3 vs 2 conditions but I am still not clear. Like when I compare how salty 3 types of pizza, I got no difference but comparing the first and third type of pizza, I got difference. When comparing 3 types, it also compare the first and the third one, right?Chris, the process of comparing three conditions is exactly the same for limma as it is for anova, and the same for pizza as it is for pathway scores. It is well known that a comparison that is significant for a two-group t-test can become nonsignificant in an anova context.
You need to show the limma code you are using for testing so that we can see what you are doing. I asked you to do that 4 days ago. The way you are asking the question is so vague that we can't tell which comparison you are testing.
I also advised you to make an MDS plot. Can you please make that plot and show the results.
Hi Gordon, I have more than 2000 samples (around 1500 samples are normal control), so MDS plot is very messy.
Thanks Gordon! We can change color to distinguish between blue and black. For RNA-seq, we normalize by gene length and sequencing depth. In pathway score, I am not sure what we normalize when using:
This is not helpful. I asked you to make an MDS plot of the same pathway scores that you input to limma, but this is something different.
The plot you've posted today is showing different data than the MDS plot yesterday. I had thought that you would just change the labels into coloured dots without changing the data.
You say that you have two groups, but the plot shows only two groups.
You say that you have 2000 samples but the plot shows only around 100 points. Perhaps you have not plotted the control samples at all.
The plot shows absolutely no evidence of any differences between the two groups shown, whatever they are.
When you ask a question here on the Bioconductor forum, you are always asked to show the code that is causing you a problem. For some reason, you didn't want to do that here, even after being asked for it. I finally remembered though that you did show code 18 days ago on Biostars for what seems like the same analysis: https://www.biostars.org/p/9590349/
Seeing the code makes your problem clear, so I have edited my answer above to give the solution.
Sorry for not provide full code! I thought it was not necessary.
compare_paths
is a dataframe with around 2,000 rows, the first column isstate
(with values NC, ES, LS), other columns are pathway name.Convert value to numeric
normalized_paths <- normalizeBetweenArrays(numeric_paths, method = "quantile")
So as you suggested, I should not remove the early state groups, so when using
topTable()
, what I should put in coeff? Normally, I see people compare between 2 conditions (disease vs control, treated vs control) so I am not sure what statistic should use in 3 conditions.Hi Robert, If I don't want to use all pathway in c2 gene set but only a few pathways of interested, how can I do that? So we try to get a gmt file of only some pathways? In my case, the result when apply
trend =sqrt(setsize)
or not, doesn't change the result but want to try the good practice. Thank you so much!Hi, your
c2
object storing all the gene sets is probably alist
object, where element names are the names of the pathways given by MSigDb. Thesetsize
object is a vector of integer values storing the sizes of gene sets, in parallel toc2
, this means the first value ofsetsize
correspond to the number of genes in the first element ofc2
, and so on. Therefore, to work with a few out of those pathways in thec2
list object, you only need to _subset_ both, thec2
object and thesetsize
object, to those few pathways of interest. For that purpose you need to know either the names given by MSigDb of those few pathways or the position where they are inc2
andsetsize
. If you are unsure about how to do this, probably you need to take a course on the basics of using R.Thanks Robert!
When using
arrayWeights()
, you mean the variability within the same group, is that correct? I alway runarrayWeights()
because I naive think it would be helpful most of the time.