Limma weights, non-specific interactions
2
1
Entering edit mode
@andrewbirnberg-16677
Last seen 2.6 years ago

Hi all,

I am developing an analysis of protein microarray data and have found that limma performs very well in terms of identifying positive controls in the assay. Unfortunately, there are a large number of known, non-specific interactors that previous people have been identified in earlier experiments. For the pipeline I would like to down-weight, but not exclude, these proteins from the analysis. I have an idea, but I'm not sure if it's a good one:

The thought is to add a value to the intensity of a probe that depends on its average across past experiments. This would essentially maintain the variance while reducing the fold changes between groups for that probe. It kind of places a higher burden of proof on the that interaction for it to be real.

The problem is that I won't be seeing true fold changes anymore when I look at my output and that seems pretty bad. Does anyone out there have any other solutions to this sort of problem or can think of a way to moderate the t values by incorporating old data? Or is there some feature of limma/eBayes that could handle probe weights when calculating p-values?

Any thoughts would be appreciated!

Thanks, Andrew

limma nonspecific filtering weight • 492 views
0
Entering edit mode

It is unclear to me what you mean by a "non-specific interactor". Interacting with what? Non-specific to what? Are you saying that there are proteins on your array that always come up at DE even if they have no relationship to the treatment conditions?

0
Entering edit mode

Sorry for my delay in responding. The array is a functional protein array and interactions with the "prey" (proteins printed on the array) are detected by an antibody to the "bait" (what was put on the array). A non-specific interaction would be caused by "stickiness" of the prey. In general, we would expect this to be roughly independent of treatment conditions, which as I believe you're hinting, should prevent this from coming up as DE in the analysis. Unfortunately, I am told by the people doing the assay that it doesn't work as cleanly as this and there are cases where the sticky proteins are somewhat unstable and do not react identically across arrays, so might appear to be specific interactions when they are merely acting badly on a particular array.

I guess one way to say this might be that sticky proteins have a higher than expected variance as compared to other proteins at the same signal intensity. Not sure if this makes sense to people with more experience in this area of statistics, but it's how I'm thinking about it at the moment.

Based on this idea, another way to handle this situation might be to use all available prior arrays when calling lmFit by adding a "background" condition to the design matrix, which is only there to provide more information when calculating and shrinking variances. Then we would get the benefit of hindsight but still only calculate the contrasts we care about based on the current experiment's arrays.

Is this reasonable?

Thanks!

1
Entering edit mode

Yes, you understood what I was hinting at. Adding background arrays can be useful in some circumstances but is probably unnecessary. I'll write short answer.

3
Entering edit mode
Aaron Lun ★ 27k
@alun
Last seen 18 hours ago
The city by the bay

The main problem here is that it's hard to convert a vague sense of "I don't like these proteins" into a concrete, quantitative weight. You say that you want to downweight the problematic interactors, but by how much? What does it mean to be twice as problematic? How many angels dance on the head of a pin?

Now let's consider your proposed approach. It seems that you're proposing to add a constant value - dependent on the average intensity of the probe from past experiments - to all observations for a probe. If this is done on the log-scale, then it will have no effect on anything (trend-based shrinkage aside), as the added value will cancel out when computing variances or log-fold changes. If this is done on the raw scale, then you're adding an arbitrary pseudo-count that will affect both the log-fold changes and variances of the log-values - not good for empirical Bayes shrinkage. More generally, there is no sound rationale for adding the average intensity. Why not half the average? Or the square root of the average? Or any number I might pull out of thin air?

The better approach is to make use of the definition of these problematic interactors. You say that these were identified as being non-specific in previous experiments. If this is non-specific DE, presumably you have log-fold changes for these proteins from previous experiments as well. This means you can test for changes in expression beyond this non-specific log-fold change using the treat function. Some testing suggests that lfc can support numeric vector inputs, so you can give a different lfc per gene. If the log-fold changes are larger than what would be expected due to non-specific activity, then there's probably a genuine change in expression.

Or you could take the safe route and just discard the problematic interactors. This would be my preferred approach if there's not too many of them. Why take the risk of reporting them when they're known to be non-specific? I know my collaborators would be pretty skeptical.

1
Entering edit mode

I agree, I would be making some very arbitrary choices here that could affect the analysis is unpredictable ways. I really like the idea of using a vector argument to lfc. I will play around with this. Removing the problematic interactors wouldn't work in our case since there may still be some biological rationale even for non-specific interactions having some importance. My favorite idea, but which I am not really sure how to build, is a probabilistic graphical model that spits out a probability of differential interaction and is able to take advantage of prior data. There are a couple of papers out there on hierarchical bayesian models for DE, but nothing I've found where it comes to "informative" priors.

2
Entering edit mode
@gordon-smyth
Last seen 30 minutes ago
WEHI, Melbourne, Australia

From your description, it sounds like you are simply worried about probes with large variances. limma already downweights probes with large variances. Belinda Phipson and I have even written a paper specifically about how to deal with probes that have particularly large variances (which we call "hypervariable" genes):

Robust hyperparameter estimation protects against hypervariable genes and improves power to detect differential expression

and have implemented a method for dealing with them efficiently through the limma empirical Bayes procedure. To activate this strategy, specify robust=TRUE when running eBayes() or treat().

So limma already does what you want. It will generally deal with hypervariable probes very sensibly, requiring them to show a much larger fold change than other genes before they are counted as DE.

If you have prior data that truly represents the variability of the different probes, then you could consider adding that to your experimental design and analyse it at the same time as you data of interest. That does assume however that the probes will continue to have the same underlying variances in the new experiment as in the old, and I think that it is a very strong assumption. I think this will usually be unnecessary and a straightforward analysis will be enough. Generally speaking, your attempts to micro-manage the process will just get in the way of limma doing its job.