Hey,
what are your opinions on using limma::arrayWeights() for a LC-MS/MS proteomics data-set on biological replicates ? I performed stress treatment on six independent biological replicates of Arabidopsis leaves but the treatment was definitely not 100% homogeneous due to fluctuating in stress intensities, so i want to weight the biological replicates, but i don't want to completely exclude single replicates. Samples are isobaric labeled Highlight / Low light (in a SILAC-manner) and transformed and median normalized to log2 values.
I tried lmfit(method = "robust") which gave me much more significant hits than the standard least square fit. But judging on the raw values of the six replicates it provides to much false positives and i don't trust the given p-values.Gordon Smyth mentioned before in some posts he also doesn't really like the lmfit robust method at all and his lab instead prefers robustness (if necessary) on eBayes level.
Well, using eBayes (robust = T and/or trend = T) has very little effect on the p-value significant observations of my dataset.
But arrayWeights appears to be a quite good compromise for my data-set, but I'd appreciate some opinions how you would judge it used on a proteomics study since i haven't found a single publication that is actually using it for this case.
Additional question for experts on arrayWeights():: I'm tuning the "prior.n" parameter for arrayWeights(). Generic is "10" , but i got best results with "40" (as far as i understand this squeezes the array weights more strongly towards equality). However, documentation on this is scarce. Do you think its "legit" to tune on prior.n? or should i confirm any assumptions first?
Thanks for any help Steve
Thank you very much for your clarifications!
So in simple terms, does a small prior.n value result in stronger arrayWeighting and a large value result in less ("squeezed") weighting? So a large value would be more conservative? What would happen in the most extreme case of setting prior.n = 1? For my dataset, the adj.p-value is quite sensitive for different prior.n in the range of 1-100, but there's no clear trend. For instance, i get :
Compared to transcriptomics experiments, the number of datapoints in (my) proteomics experiment is rather small (n ~ 1200 quantified proteins) . Could that have an influence? Can i just choose prior.n = 1 to keep the prior likelihood as small as possible and maximize the weighting effect? Here's an example of the weighting output:
For replicate 5&6 the weightings don't seem to be more squeezed towards equality for larger prior.n
I know, the differences seem very small, but i haven't found any proteomics publication using arrayWeights yet, so a statisticians opinion would be great before submitting the data. Of course, I'm going to keep a constant prior.n parameter consistently for the entire data-set and I won't do "cherry picking" for any time point or sub-experiment.
As an alternative, could i just use method ="reml" for arrayWeights? If i get Ritchie et al, 2006 (Empirical array quality weights in the analysis of microarray data) correctly, the "reml" algorithm is superior to "gene-by-gene" when it comes to accuracy, but slower on computation (well, on 2021 desktop PCs this seems negligible, both algorithms finish within seconds on my data).
Thanks!
I think you're over thinking this.
When I set a default parameter in limma, I do that because I think it will be a good setting for general use, and you can take that as a recommendation. If the parameter needed to be tuned to particular datasets then the documentation would tell you that. Using the function is not supposed to be hard -- you just run it as shown in the examples. As I said, I use the default settings myself in my own work, and I use arrayWeight regularly for all sorts of data including proteomics. I don't see any reason to change my recommendations, especially as the default seems to work just fine on your data.
You seem to be concerned that you can get a tiny increase in the number of DE genes by changing prior.n on one dataset, but I don't think that provides good evidence to overturn the default. There is no guarantee that the increased number of DE genes will be repeated on other datasets or that the extra DE genes are true discoveries.
Yes that's right. The prior pulls the weights toward equality. It is as if the data contained an extra prior.n proteins for which the samples are equally precise.
Yes.
When I look at your results, the DE results seem relatively insensitive to prior.n and the trend to less DE at larger prior.n values is quite clear.
The default value of prior.n=10 was chosen because it will give good results regardless of the number of genes/proteins. One reason why I chose prior.n=10 instead of prior.n=1 was to ensure stable results even when only a handle of genes are in the dataset (less than 10) so 1200 is no trouble at all.
In general, larger prior.n does squeeze the weights towards equality.
Unless your data has NAs, you are already using reml because the method is being chosen for you automatically, unless you specifically choose another option. See the documentation ?arrayWeights. Again, using the defaults will give the best result.
The choice between reml and gene-by-gene is different now from in 2006. If your data has no NAs or prior weights then reml is now faster than gene-by-gene as well as better. REML only becomes slower in the presence of prior weights (for example from voom). If your data has NAs then gene-by-gene is better because reml isn't implemented for NAs and hence rows with NAs will simply be removed (with a warning). By default, arrayWeights makes this choice for you automatically.
Steve, I've edited this reply just now. Amongst other things, I realized I should have agreed with your first question quoted above. I had read it the wrong way around.
Thanks for your detailed answers Gordon, you convinced me to just run arrayWeights with your proven default parameters and straightforwardly cite your 2006 paper, reviewers should be satisfied. Even if tuning prior.n could give me some more positives (true or false..).
It's probably one of my hardest lessons to learn during PhD (and i think for many others) to not focus too much on "interesting" hits that slightly fail given significant thresholds and don't waste time on them or blame statistics for it, but to rather focus on the big picture.
Just a final question: I understand the "reml" algorithm discards observations containing any NA while the "Gene-by-Gene" algorithm can somehow deal with them. But wouldn't it make sense to discard any NA observations at all before applying arrayWeights, because you can't make a statement or weight on how good an Array correlates with the others based on an observation (datapoint) that contains an NA?! Isn't it more reliable to weight on (for my dataset) the 800 "core/abundant/reliable" proteins that contain no missing values using the "rembl" instead of the 1200 proteins that include NAs for some replicates (and may therefore not be so reliable) using the "gene-by-gene".
Best S
OK, until now I wasn't clear that your data had NAs, because there is more than one way to treat non-detected proteins from mass spec and they don't all lead to NAs. Believe me, the way you decide to process non-detected proteins in a proteomics experiment will make vastly more difference to your results than the choice of prior.n.
Proven is too strong a word. I've just chosen values that make sense on general principles and have given sensible results on a range of datasets.
There's nothing magic about the 0.05 FDR cutoff. You are free to use FDR < 0.1 (for example) and to interpret the extra hits as long as you clearly state in your paper the cutoff you used.
Not in general. What if you have 100 samples and every single protein contains one NA? Would you then throw every single protein out? One could argue that the proteins that contain NAs are exactly the ones that show the sort of variability that you want arrayWeights to downweight.
Anyway, there is no correct answer. It would be reasonable for you to choose to use "reml" but I can't advise you on it. You have to take responsibility for the choice yourself. In my opinion, introducing NAs into proteomics data is intrinsically satisfactory. There's no way to make perfect downstream decisions about NAs when the data is ambiguous to start with.