Hi Martin,
I'm one of the people who sent Kaushal to bioc-list, both for a second
opinion and to create a public record of these discussions. Thank you
for
responding -- I feared that I'd led the original poster astray! I
hope you
will take the following comments in the spirit they are intended,
namely,
as fodder for discussion, rather than criticism.
> Highly variable probes are not necessarily differentiating between
groups
(if that is what you are aiming at).
This is an odd assertion, which perhaps I am misunderstanding.
However,
without some degree of variation to partition, it is difficult if not
impossible to determine what is biological or technical in origin, and
what
is condition-specific. Moreover, the smaller the variance, the more
likely
it is to be technical (rather than biological) in origin. Can you
provide
an example of a relevant biological difference where overall
variability
would mask the effect?
> [ re: Dedeurwaerder S, Defrance M, Bizet M, Calonne E, Bontempi G,
Fuks
F. A comprehensive overview of Infinium HumanMethylation450 data
processing. Briefings in bioinformatics. 2013. and similar... ]
I'd like to suggest
http://biorxiv.org/content/biorxiv/early/2014/02/23/002956.full.pdf ,
which
beautifully disposes of a great deal of the BS surrounding
normalization of
450k data. If there is a more relevant treatment of general-purpose
preprocessing and normalization on Illumina 450k (or 27k, for that
matter)
data, I haven't seen it. The fundamental problem with many (most?) of
the
other reviews is their limited scope, editor-appeasing benchmarks, and
a
tendency to replace objective comparisons with so-called "expert
opinion".
JP Fortin's paper avoids all that. Do have a look!
> I assume you used M values for filtering, not beta?
While the quasi-linearizing effects of the logit transform lend
themselves
well to the assumptions we like to make as statisticians (the
multivariate
normal distribution does possess many useful mathematical properties),
it's
not at all clear to me that the fold-change associations "discovered"
at
individual loci are always worth noting. When they hold up across
multiple-locus bumps after normalization, on the other hand, the
findings
tend to be more interesting. However, when they hold up across
multiple
loci in "bumps", paradoxically, beta values often find the same bumps.
There are many reasons to believe that, in general, these "bumps" are
the
biologically relevant quantity of interest.
In any event, absent a basis for such bump-hunting a priori, and when
pretending that individual CpG loci are independent, concentrating on
the
most variable loci (on either scale, believe it or not) after
preprocessing
and normalization seems to increase power to detect real biological
differences for the same reason as it does on expression arrays: the
closer
you get to the limit of detection, the more likely you are to see
spurious
results. The further you are from the limit of detection, the more
likely
you are to see higher overall variance (or MAD, or whatever) relative
to
the population. If you have a lot of strong confounders, of course,
you'll
have a different set of problems; but then you might ask why the
experiment
was designed in such a fashion if that were the case.
> Please note that the 450K data is in some fundamental ways different
to
"regular" gene expression data and that
specific tools might be more applicable.
I will remark at this point that Kaushal has been using minfi and
friends,
so the pipeline isn't completely insane. Therefore I will address
something
that seems implicit in your remark, namely, the different sort of
correlation (more spatial than dynamic, for lack of a better phrase)
that
is expected in DNA methylation data
One could make the argument, and not without justification, that
collapsing
measurements onto "bumps" of significant regional changes is a more
useful
first step in this process, since they tend to suppress noise. But
then
you need to have some basis to define the bumps (are they defined by
transcription factor footprints? By broad lamin attachment domains?
By
local correlation between CpGs?). Absent such a basis, it's often
useful
in exploratory analysis to consolidate your statistical power by
testing a
subset of highly variable loci, just as with any other high-
dimensional
data type where you believe the true signal to be sparse (if it is
present
at all). The justification presented in
http://www.pnas.org/content/107/21/9546.long does not claim that genes
on
an expression array are independently expressed, simply that variance
filtering empirically improves power to detect differences.
Until such time as we have an unsupervised method which reliably
detects
regional changes of objectively superior value in 450k methylation
data
(for example, I finally got around to experimenting with the
A-clustering
method described in
http://bioinformatics.oxfordjournals.org/content/29/22/2884 to
evaluate
it), variance filtering is not such a bad idea. And, again, in
practical
application you may be surprised to find that M-values and beta values
both
have their strengths and weaknesses. If you can squash *all* of the
technical artifacts, M-values are theoretically more appealing, but my
experience (across about 12,000 samples from various experiments) has
been
that said squashing is more difficult than the publication-biased
literature might have you believe. Your mileage may vary!
Thank you again for responding to Kaushal's query, and I hope you'll
have
some further remarks of your own.
All the best,
--t
Statistics is the grammar of science.
Karl Pearson <http: en.wikipedia.org="" wiki="" the_grammar_of_science="">
On Tue, May 6, 2014 at 7:53 AM, Martin Rijlaarsdam <
m.a.rijlaarsdam@gmail.com> wrote:
> Hi,
>
> Highly variable probes are not necessarily differentiating between
groups
> (if that is what you are aiming at). Please give some more
information
> about your experiment and what tool / procedure you use for testing
and
> correction for multiple testing. Also look at
>
> Wilhelm-Benartzi CS, Koestler DC, Karagas MR, Flanagan JM,
Christensen BC,
> Kelsey KT, et al. Review of processing and analysis methods for DNA
> methylation array data. Br J Cancer. 2013;109(6):1394-402.
>
> for some more 450K specific ways to handle this data. I assume you
used M
> values for filtering, not beta? Please note that the 450K data is in
some
> fundamental ways different to "regular" gene expression data and
that
> specific tools might be more applicable. Also see.
>
> Dedeurwaerder S, Defrance M, Bizet M, Calonne E, Bontempi G, Fuks F.
A
> comprehensive overview of Infinium HumanMethylation450 data
processing.
> Briefings in bioinformatics. 2013.
>
> Kind regards,
> Martin
>
> --
> M.A. (Martin) Rijlaarsdam MSc. MD
> Erasmus MC - University Medical Center Rotterdam
> Department of Pathology
> Room Be-432b
> Shipping adress: P.O. Box 2040, 3000 CA Rotterdam, The Netherlands
> Visiting adress: Dr. Molewaterplein 50, 3015 GE Rotterdam, The
Netherlands
>
> Email: m.a.rijlaarsdam@gmail.com
> Mobile: +31 6 45408508
> Telephone (work): +31 10 7033409
> Fax +31 10 7044365
> Website:
http://www.martinrijlaarsdam.nl
>
>
> On Tue, May 6, 2014 at 4:43 PM, kaushal [guest]
<guest@bioconductor.org> >wrote:
>
> >
> > Hello list;
> >
> > I have 450 k human DNA methylation data. I used genefilter
package to
> get
> > the 50% most variable CpG sites that gives me only half of the CpG
sites
> > for analysis that was originally in 450 K. However, CpG sites are
still
> > not significant according to adjsuted p-values. I am not quite
sure what
> > could be the reason for this? Thanks for any insights.
> >
> > Thanks !!!
> >
> >
> > -- output of sessionInfo():
> >
> > None
> >
> > --
> > Sent via the guest posting facility at bioconductor.org.
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor@r-project.org
> >
https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives:
> >
http://news.gmane.org/gmane.science.biology.informatics.conductor
> >
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor@r-project.org
>
https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
>
http://news.gmane.org/gmane.science.biology.informatics.conductor
>
[[alternative HTML version deleted]]