I'm semi new to LIMMA. I've already run an analysis on some proteomic data with a 2x2 factorial design, but I am trying to change it to account for the paired nature of my design. I am trying to identify proteins under the control of a particular translational mechanism, so we are using a drug that turns on the mechanism and a mutation that turns on the mechanism. Proteins that are commonly effected by the drug and mutation are candidates for further experiments.
So I have four groups of in eight vitro cell lines, Wild-type untreated, Wild-type drug treated, Over Expresser untreated, and Over Expresser drug treated. The wild type cell line was split in half for the treated and untreated, and so was the mutant treated and untreated. So I believe any tests that compare WTu to WTt should be paired (same for mutant). I have already done the 2x2 factorial design found in section 9.5 of the LIMMA manual, but combining section 9.4 and 9.5 for the paired design has not worked for me. I've read other posts on here, and I'm pretty sure that it's not working because there's some sort of linear dependency. I can't figure out how to solve the problem. Though I feel like the solution is going to be related to how I'm making my matrix or expressionsheet. Is there a way I can format my expression sheet so that it's all included in one model, or should I split up my code into doing two different models (one for drug effect and one for mutant effect)?
Here is the code that I'm trying to run:
#load and assign necessary data
M002_E <- read_excel("expresion sheet.xlsx", sheet = "Expression Data", col_names = TRUE, col_types = "numeric")
M002_P <- read_excel("expresion sheet.xlsx", sheet = "Phenotype Data", col_names = TRUE, col_types = "text")
M002_F <- read_excel("expresion sheet.xlsx", sheet = "Feature Data", col_names = TRUE, col_types = "text")
PN <- unlist(read_excel("expresion sheet.xlsx", sheet = "Sample Names", col_names = FALSE, col_types = "text"))
FN <- unlist(read_excel("expresion sheet.xlsx", sheet = "Feature Names", col_names = FALSE, col_types = "text"))
# LIMMA groundwork: create Expression Set Object
mat_M002_E <- data.matrix(M002_E, rownames.force = FALSE)
rownames(mat_M002_E) <- FN
rownames(M002_P) <- PN
rownames(M002_F) <- FN
WLPset <- ExpressionSet(assayData = mat_M002_E,
phenoData = AnnotatedDataFrame(M002_P),
featureData = AnnotatedDataFrame(M002_F))
#Quantile Normalization: check initial distribution (optional)
exprs(WLPset) <-log2(exprs(WLPset))
WLPset_norm <- WLPset
exprs(WLPset_norm) <- normalizeBetweenArrays(exprs(WLPset_norm))
study_design_drug_mut_com<- model.matrix(~0 + Group +Pairs, data = pData(WLPset_norm))
fit_drug_mut_com <- lmFit(WLPset_norm, study_design_drug_mut_com)
and I get this error
Coefficients not estimable: Pairs9
Warning message:
Partial NA coefficients for 3735 probe(s)
I wish I could make a table to show what my expression sheet and matrix look like, but I don't see an option to include that. So I'll describe it.
Expression sheet: one column labeled Group, and one column labeled Pairs. Group column has labels for which group each sample belongs to: WTu, WTt, OEu, OEt. Pairs contains a number corresponding to which pair is being specified. So a cell line sample in WTu has a "1", and the same cell line in WTt has a "1" as well. This pairs labeling continues for all paired samples, so that there are 16 pairs (32 samples total).
Model Matrix: As you'd expect there are four columns for each of my groups, with zeros in most of the rows except for those samples that belong to that group. There should be 16 Pairs columns, with zeros in all of the rows except for a one in two rows that denotes the samples that belong to that pair. However, Pair1 gets removed , so that there are 15 Pairs.
I'm beginning to think that what I'm trying to do might not be possible, and I'm going to have to split my model up into smaller models. For example, just do a paired design across my drug treated, and then do a paired design for my mutants separately. Let me know if you see a way I can do this in one go though.
Thank you.
Hello Gordon,
Thank you for the advice. I tried your suggestion using this code:
It appears to work and I no longer have error messages, but now I don't know how to specify my contrasts. My original code that ignored the pairedness of the data used these contrasts:
Without all of my groups being specified, is there a way to call on these same contrasts?
Thank you.
The Treated vs Untreated effects in WT and OE are already coefficients in the linear model. The coefficient
WT.treated
is already the comparison of WTt vs WTu (adjusted for cell line) and the coefficientOE.treated
is already the comparison of OEt vs OEu (adjusted for cell line).If you want the Interaction, then you simply take the contrast
OE.Treated - WT.Treated
.The contrast that you have called MutantEffectInWT is impossible to estimate in a paired comparison. You can't adjust for differences between cell lines and also make baseline comparisons between the genotypes.
BTW, an experimental design similar to yours is covered in Section 3.5 of the edgeR User's Guide. Advice about design matrices and contrasts is the same for edgeR as for limma.
Hello Dr. Smyth,
Thank you for the explanation. I read section 3.5 of the edgeR user guide as you suggested and that did help with clarity.
Much appreciated.
Hello again Dr. Smyth. Sorry to bring up this issue back up, but I can't figure out why my Volcano plot looks so strange. I used your suggested code as follows:
but my volcano plot does not look like a volcano. When I extract the data I want using topTable (shown above), it all looks fine. There are no logfold changes above 1 or 2, and the pvalues aren't that low either. I also tried topTable with the OE.treated and there were no unusually high fold changes or low pvalues. I have tried but can't find a reason why the new paired design would cause the volcano plot to be so strange.
The
volcanoplot
function has a 'coef' argument, just liketopTable
, and you are accepting the default, which is the first coefficient. You instead need to choose one of the other coefficients such as 'WT.treated`.Thank you Mr. MacDonald, using the coef argument worked for me. I'm a little embarrassed I didn't figure out a simple fix like that, but it's all part of the learning process I suppose.
That's the best way to learn - make little mistakes that stick in your mind later.
I didn't see earlier that you are in Ann Arbor as well. GO BLUE!
Hello Mr. MacDonald,
Sorry to bring this issue back up again, but I've been given a hard task and I need help with a bit of conceptual understanding that has to do with something Dr. Smyth said previously:
"The contrast that you have called MutantEffectInWT is impossible to estimate in a paired comparison. You can't adjust for differences between cell lines and also make baseline comparisons between the genotypes."
For context: I'm doing proteomics in LIMMA (missing data handled with other methods) on some data that was generated by a previous scientist, and I am now analyzing. I have two cell lines (mutant and WT), and each cell line was split in half and given the same drug. So I have four groups: Wild-type untreated (WT.U), wild-type treated (WT.t), Over-expresser untreated (OE.u) and Over-expresser treated (OE.t). Where WT.u and WT.t are paired, and OE.u and OE.t are paired.
The way I solved my original problem was to do two separate analysis. One with a paired analysis in which I extracted (WT.t - WT.u) in order to find proteins that respond to drug. Then one with an unpaired design in which I extracted (OE.u - WT.u) in order to find proteins that responded to change in genotype. I then compared the hits from these two analyses to find proteins that respond to drug and mutant separately. That's the goal, the drug and mutant should upregulate a certain cellular process and we want to know what proteins are being effected by that process. I don't believe the combination of the drug and mutant are very interesting for us. However, now my PI wants me to find a way to do the analysis that is more similar to a two way ANOVA design, in which it considers all the data at once, and gives overall Mutant effect and drug effect, and an interaction term, while still considering the pairedness of the data.
I've read section 3.5 of the EdgeR manual as Dr. Smyth suggested, and I don't believe that design will work for the biological question we are interested in. It's similar, but if I apply their code to my data then I believe the biological interpretation would be, "which proteins change in response to drug across different genotypes". Whereas, the question we are interested in is "which proteins change in response to drug and also genotype". I believe the section 3.5 suggestion focuses on the drug and it does not look for proteins that are effected by genotype separately from the drug. So this achieves the drug effect and interaction term that a two way anova would, but I don't see a way to extract a mutant effect.
My questions are as follows: is my interpretation of section 3.5 correct? Is there a way to do a two way ANOVA with paired data in LIMMA? Is there no possible way to "adjust for differences between cell lines and also make baseline comparisons between the genotypes"? If there isn't a way to do the analysis I'm looking for, then is that a quirk of LIMMA, or is it just statistically not possible?
Any help would be greatly appreciated. Thank you.
What Gordon is implying is that you don't have the degrees of freedom to fit a paired analysis AND fit a conventional two-way ANOVA with an interaction term. You can't even do a paired analysis and the two main effects, which is why Gordon shows in section 3.5 how to construct a design matrix for this study that will allow you to compute the treatment differences within genotype plus the interaction.
Saying you don't have sufficient df to make comparisons is essentially the same thing we (I) learned in basic algebra back whenever they taught that (like middle school? Who knows). But anyway, recall that you need N equations to solve for N variables. So you cannot solve
2x + 3y = 2
alone, as it's unidentifiable. There are an infinite number of solutions for x and y that will work. You need another equation like5x - 2y = 10
that you can solve simultaneously to get the correct values for x and y.Fitting a linear model is the same - you use linear algebra, which is just a fancy way of solving multiple equations simultaneously - and the same rules apply. If you add coefficients for each pair, you are adding enough unknowns that you can no longer identify the other coefficients that you care about.
I don't think you want to fit the model your PI wants you to fit, if the goal is to identify proteins that react similarly to the mutation as to the treatment. The interaction tests for proteins that react differently due to the treatment given the genotype, which is not the same thing. You probably just want to stratify on genotype and then compare the list of proteins, as I believe you have already done. But this post has gone way past the purpose of this site (help with technical aspects of using the software rather than statistical help). If you need statistical support, you might contact a local statistician to help.
Apologies for going past the purposes of this site, but I have a question that I believe will get us back on topic. I realized that a two way anova is not the model that I should be looking to replicate, but a mixed model would make more sense. I looked at the LIMMA manual under section 11.3 where it describes a way to do a mixed model.
Now my pairs are included as random effect, instead of being explicitly included as a separate term in the model. I believe this should be a way to skirt around the issue of being over parameterized? I'm still trying to modify the code since my results for Mutant Effect were outrageous. It found every protein in the data set to be significant. Drug effect looked reasonable though. But I wanted to ask if I'm on the right track by changing the way the model is using the pairs?
You can certainly use a GLS model instead of a conventional model with blocking, but they are really meant to do separate things.
Blocking is meant primarily to account for differences in protein abundance in the various cell types. Consider a hypothetical where a given protein is expressed at a 50% higher level in one cell type as compared to the other cells, but the difference between treated and untreated is about the same for all the cell types. By blocking on cell type, you adjust out the differences in average abundance due to cell type, and can now more accurately estimate the between-treatment differences.
There are no differences between the coefficients for a conventional linear model and a GLS model. The only difference is that the GLS is capable of more accurately estimating the standard errors because it incorporates the within-block correlation structure. In other words, a conventional linear model assumes the observations are iid (independently and identically distributed), in which case the between-sample variance is assumed to accurately estimate the population variance. But if you have samples that are not independent (e.g., aliquots from the same cell suspension) then the variability between those samples is likely to be reduced, and will tend to bias your statistic towards the alternative. A GLS incorporates the estimate of the within-block correlations in order to more accurately estimate the variance.
But remember how I said there are no differences between the coefficients for the conventional and GLS regression? Blocking will adjust for any cell-specific changes in the mean expression of a protein, but a GLS will not, and if you have large cell-specific changes, then you will tend to bias towards the null because you will incorporate those cell-specific changes into the sample variance estimate, rather than adjusting them out by blocking on cell.
Fitting the GLS makes it seem like it's better because you don't have the issues you had with blocking. But you are making assumptions with either model, and if your assumptions are not correct, then your results are likely to suffer.
Thank you for the explanation. If I'm understanding you correctly, then you are comparing two possible models where one uses blocking, and the other uses the corfit function to become a GLS. However, I thought I was including both blocking and GLS by including the "block" argument and "correlation" argument in the lmfit function. Am I misunderstanding how lmfit works?
If you need to make baseline comparisons between the genotypes (between blocks) as well as treated vs untreated (within blocks) then you need to fit a multilevel model as in Section 9.7 of the limma User's Guide.
This type of analysis takes the pairing into account but is no longer called a "paired analysis". It is equivalent to a repeated measures ANOVA. A multilevel analysis makes more assumptions than a paired analysis but allows you to make all the contrasts in your original question.