Question: Filtering is not recommended with LIMMA?
gravatar for Gordon Smyth
4.3 years ago by
Gordon Smyth31k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth31k wrote:
Dear Wolfgang, With all respect, I meant exactly what I said. You have taken the discussion out of context, and some of your claims are wrong in my opinion. On Sun, 26 May 2013, Wolfgang Huber wrote: > Dear Gordon > >> The literature tends to say that the reason for filtering is to reduce >> the amount of multiple testing, but in truth the increase in power from >> this is only slight. The more important reason for filtering in most >> applications is to remove highly variable genes at low intensities. >> The importance of filtering is highly dependent on how you >> pre-processed your data. Filtering is less important if you (i) use a >> good background correction or normalising method that damps down >> variability at low intensities and (ii) use eBayes(trend=TRUE) which >> accommodates a mean-variance trend. You have taken out of context one paragraph from my reply to Miriam: .html I was answering a specific question about the limma package, but you have lost that context. You don't even include the date of the post you are replying to. > With all respect, I think this paragraph mixes up two separate issues > and can benefit from clarification. > > 1. While literature can probably be found to support any statement, the > above-cited reason is indeed bogus when multiple testing is performed > with an FDR objective. Not bogus. Just less important than some other considerations. > The paper by Bourgon et al. motivates filtering differently, namely by > using a filter criterion that is independent of the test statistic under > the null (thus does not affect type-I error; some subtlety is discussed > in that paper) but dependent under the alternative (thus improves > power). This is a good time to recall that the question was about filtering with the limma package, not about filtering in conjunction with t-tests or permutation tests. Your paper (Bourgon et al) provides no motivation for filtering in conjunction with limma. Quite the opposite, your paper concludes (incorrectly IMO) on its final page that limma needs to be used unfiltered. In reality, filtering low intensity probes (not low variance probes) is usually of benefit to limma, and we do this routinely for nearly all analyses in my lab. This is for a number of reasons. First there is the generic (not specific to limma) reason that probes that are not detecting real signal to any worthwhile degree for any sample cannot be detecting DE to any worthwhile degree. Therefore there is a positive correlation between mean log intensity and true DE. Second there is the limma-specific reason that probes that are not detecting signal above background levels in any sample trend to have atypical variances, both in absolute size and in terms of mean- variance relationship, compared to probes that are responding to genuine biological signal. In other words, non-expressed or dead probes have variances that cannot be considered to be sampled from the same population as variances for probes from regular expressed probes. It is desirable to get rid of these atypical probes so that limma can concentrate on the behaviour of probes of genuine interest. Filtering by mean log-intensity does not cause any problems for the limma probabilistic model. Indeed it generally improves concordance with the empirical Bayes assumptions. > 2. "Highly variable genes at low intensities" are indeed a problem of > bad preprocessing and are better dealt with at that level, not by > filtering. I agree in most cases, but it's not universally true. Pre-processing methods that damp down variality at low intensities also tend to attenuate fold changes. In some applications it can be legitimate to allow higher variability at low intensities in order to maintain dynamic range in the fold changes. voom is one such application where the preprocessed and normalized expression values are deliberately kept more variable at the low end than the high end. > Nowadays, the commonly used methods for expression microarray or RNA-Seq > analysis that I am aware of avoid that problem. Yes, the high variability is gone but the non-expressed probes are still atypical. With most commonly used methods, the non-expressed probes now have atypically small variances. For example, the RMA algorithm (used in your paper) yields a mean variance relationship that increases at low intensities then decreases again at high intensities. The lowest intensity probes have variances almost zero. This effect is even stronger using the vst algorithm for Illumina BeadArrays (you are an author of the vst paper). This method typically generates a very pronounced (increasing) mean-variance trend for probes at very low levels. Anyone can see this by using the plotSA() function in limma to plot the mean-variance relationship. Atypical low variances mitigate the potential benefits from the empirical Bayes algorithm just as do atypical large variances, so the benefit that derives from filtering non-expressed probes remains. The reason I worded my post in terms of high variances was simply because the strongest and most frequent arguments for filtering were made over 10 years ago when large variances were common. > 3. The question when & how independent filtering (as in 1) is beneficial > is quite unrelated to preprocessing. I strongly disagree. The benefit that may or may not come from filtering is intimately connected to the behaviour of the data, especially to the mean-variance trend, and this depends intimately on the platform and on the preprocessing. Sincerely Gordon > You are right that FDR is a property of the whole selected gene list, > not of individual genes, and that different approaches exist for > spending the type-I error budget wisely, by weighting different genes > differently; of which independent filtering is one and trended eBayes > (which is not the default option in limma) may be another. > > Best wishes > Wolfgang > > Reference: > Bourgon et al. PNAS 2010: ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}
ADD COMMENTlink written 4.3 years ago by Gordon Smyth31k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 103 users visited in the last hour