Dear Marcin,
Variance filtering should not be used at any stage of the limma
analysis.
You are right to be worried by it. The Bioc posts you mention from
2009
and 2012 were about filtering by expression level, not by variance.
Variance filtering has only been shown to be valid and beneficial when
using ordinary t-tests. But greater benefits can be had by using the
limma empirical Bayes t-test and filtering by expression.
If you think that very small or very large variances are an issue with
your data, then you could discount them in a statistically valid way
by
using the robust option of the eBayes() function in limma. Again this
will give greater benefits than ad hoc filtering by observed
variances.
Apart from the fact that variance filtering invalidates the limma
algorithm (or any empirical Bayes algorithm), it also worries me that
variance filtering lacks a good biological interpretation, whereas
filtering by mean expression has the clear interpretation of removing
genes that are not at worthwhile expression levels.
Best wishes
Gordon
> Date: Fri, 23 May 2014 13:22:23 +0200
> From: Marcin Jakub Kami?ski <marcinjakubkaminski at="" gmail.com="">
> To: Ryan <rct at="" thompsonclan.org="">
> Cc: genefilter Maintainer <maintainer at="" bioconductor.org="">,
> bioconductor at r-project.org
> Subject: Re: [BioC] genefilter vs limma - many probes filtered
>
> Hello Ryan,
> thanks for your clear elucidation on this.
> Shame to admit, but after performing some additional reading I
believe that
> my question should (at least partially) have never been asked - in
limma
> guide it's advised to filter-out low intensities rather than low
variances
> and more details can be found in this discussion:
>
https://stat.ethz.ch/pipermail/bioconductor/2013-June/053071.html,
which in
> fact agrees with your response.
> However, I'm still unable to find any straightforward answer to the
> question about filtering by variance after the eBayes() procedure (
>
https://stat.ethz.ch/pipermail/bioconductor/2012-March/043895.html,
>
https://stat.ethz.ch/pipermail/bioconductor/2009-October/030062.html).
> Also, I'm still worried about such 'beneficial' change after
extensive
> filtering, especially as I didn't found any cases, when >50% of
genes have
> been filtered.
>
> Best regards,
> Marcin
>
>
>
> On Fri, May 23, 2014 at 5:33 AM, Ryan <rct at="" thompsonclan.org="">
wrote:
>
>> Hi Marcin,
>>
>> I believe that performing variance filtering is not compatible with
the
>> empirical Bayes methods employed in limma. The point of limma is to
compute
>> a moderated estimate of each gene's variance by using the average
variance
>> across all genes as a prior estimate. If you filter out genes based
on
>> their variance, then you will bias that prior estimate, and this
bias will
>> propagate to the posterior estimates. For example, if you filter
out
>> high-variance genes, limma will underestimate the prior variance,
and
>> overestimate the significance of your differential expression
calls, which
>> is not a desirable outcome.
>>
>> It may possibly be defensible to perform variance filtering after
the
>> empirical Bayes step, but I'm not sure, and you would have to ask
someone
>> more knowledegable about such matters.
>>
>> -Ryan
>>
>>
>> On Thu May 22 18:41:24 2014, Marcin Kaminski [guest] wrote:
>>
>>> Dear list,
>>> I've followed the tips regarding gene filtering at
>>>
http://www.bioconductor.org/packages/release/bioc/
>>> vignettes/genefilter/inst/doc/independent_filtering.pdf when
analyzing
>>> GEO data (GSE48060). In this case most probes would pass the tests
(for
>>> adj.p. < .05) if I filter out roughly 70% of them based on
variance, which
>>> will triple the number of positives compared to not filtering at
all.
>>> (related graphic:
http://i.imgur.com/RuuvRIo.png)
>>> Should I be concerned about such extensive filtering? Does it
affect
>>> further analysis with limma and introduce bias? If it's a problem,
what are
>>> the available solutions or diagnostics?
>>>
>>> Thanks for your help!
>>>
>>> Best regards,
>>> Marcin
>>>
>>>
>>> -- output of sessionInfo():
>>>
>>> R version 3.1.0 (2014-04-10)
>>> Platform: x86_64-w64-mingw32/x64 (64-bit)
>>>
>>> locale:
>>> [1] LC_COLLATE=Polish_Poland.1250 LC_CTYPE=Polish_Poland.1250
>>> LC_MONETARY=Polish_Poland.1250 LC_NUMERIC=C
>>> [5] LC_TIME=Polish_Poland.1250
>>>
>>> attached base packages:
>>> [1] parallel stats graphics grDevices utils datasets
methods
>>> base
>>>
>>> other attached packages:
>>> [1] RColorBrewer_1.0-5 hgu133plus2.db_2.14.0
org.Hs.eg.db_2.14.0
>>> RSQLite_0.11.4 DBI_0.2-7 AnnotationDbi_1.26.0
>>> [7] GenomeInfoDb_1.0.2 genefilter_1.46.1
matrixStats_0.8.14
>>> limma_3.20.3 GEOquery_2.30.0 Biobase_2.24.0
>>> [13] BiocGenerics_0.10.0
>>>
>>> loaded via a namespace (and not attached):
>>> [1] annotate_1.42.0 IRanges_1.22.6 R.methodsS3_1.6.1
>>> RCurl_1.95-4.1 splines_3.1.0 stats4_3.1.0
survival_2.37-7
>>> tools_3.1.0
>>> [9] XML_3.98-1.1 xtable_1.7-3
______________________________________________________________________
The information in this email is confidential and
intend...{{dropped:4}}