Fwd: adjusted P-values

0

Entering edit mode

Martin Rijlaarsdam ▴ 190

@martin-rijlaarsdam-6043

Last seen 11.4 years ago

Hi, Highly variable probes are not necessarily differentiating between groups (if that is what you are aiming at). Please give some more information about your experiment and what tool / procedure you use for testing and correction for multiple testing. Also look at Wilhelm-Benartzi CS, Koestler DC, Karagas MR, Flanagan JM, Christensen BC, Kelsey KT, et al. Review of processing and analysis methods for DNA methylation array data. Br J Cancer. 2013;109(6):1394-402. for some more 450K specific ways to handle this data. I assume you used M values for filtering, not beta? Please note that the 450K data is in some fundamental ways different to "regular" gene expression data and that specific tools might be more applicable. Also see. Dedeurwaerder S, Defrance M, Bizet M, Calonne E, Bontempi G, Fuks F. A comprehensive overview of Infinium HumanMethylation450 data processing. Briefings in bioinformatics. 2013. Kind regards, Martin -- M.A. (Martin) Rijlaarsdam MSc. MD Erasmus MC - University Medical Center Rotterdam Department of Pathology Room Be-432b Shipping adress: P.O. Box 2040, 3000 CA Rotterdam, The Netherlands Visiting adress: Dr. Molewaterplein 50, 3015 GE Rotterdam, The Netherlands Email: m.a.rijlaarsdam@gmail.com Mobile: +31 6 45408508 Telephone (work): +31 10 7033409 Fax +31 10 7044365 Website: http://www.martinrijlaarsdam.nl On Tue, May 6, 2014 at 4:43 PM, kaushal [guest] <guest@bioconductor.org>wrote: > > Hello list; > > I have 450 k human DNA methylation data. I used genefilter package to get > the 50% most variable CpG sites that gives me only half of the CpG sites > for analysis that was originally in 450 K. However, CpG sites are still > not significant according to adjsuted p-values. I am not quite sure what > could be the reason for this? Thanks for any insights. > > Thanks !!! > > > -- output of sessionInfo(): > > None > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

Cancer genefilter Cancer genefilter • 1.8k views

ADD COMMENT • link updated 11.7 years ago by Tim Triche ★ 4.2k • written 11.7 years ago by Martin Rijlaarsdam ▴ 190

0

Entering edit mode

Tim Triche ★ 4.2k

@tim-triche-3561

Last seen 5.4 years ago

United States

Hi Martin, I'm one of the people who sent Kaushal to bioc-list, both for a second opinion and to create a public record of these discussions. Thank you for responding -- I feared that I'd led the original poster astray! I hope you will take the following comments in the spirit they are intended, namely, as fodder for discussion, rather than criticism. > Highly variable probes are not necessarily differentiating between groups (if that is what you are aiming at). This is an odd assertion, which perhaps I am misunderstanding. However, without some degree of variation to partition, it is difficult if not impossible to determine what is biological or technical in origin, and what is condition-specific. Moreover, the smaller the variance, the more likely it is to be technical (rather than biological) in origin. Can you provide an example of a relevant biological difference where overall variability would mask the effect? > [ re: Dedeurwaerder S, Defrance M, Bizet M, Calonne E, Bontempi G, Fuks F. A comprehensive overview of Infinium HumanMethylation450 data processing. Briefings in bioinformatics. 2013. and similar... ] I'd like to suggest http://biorxiv.org/content/biorxiv/early/2014/02/23/002956.full.pdf , which beautifully disposes of a great deal of the BS surrounding normalization of 450k data. If there is a more relevant treatment of general-purpose preprocessing and normalization on Illumina 450k (or 27k, for that matter) data, I haven't seen it. The fundamental problem with many (most?) of the other reviews is their limited scope, editor-appeasing benchmarks, and a tendency to replace objective comparisons with so-called "expert opinion". JP Fortin's paper avoids all that. Do have a look! > I assume you used M values for filtering, not beta? While the quasi-linearizing effects of the logit transform lend themselves well to the assumptions we like to make as statisticians (the multivariate normal distribution does possess many useful mathematical properties), it's not at all clear to me that the fold-change associations "discovered" at individual loci are always worth noting. When they hold up across multiple-locus bumps after normalization, on the other hand, the findings tend to be more interesting. However, when they hold up across multiple loci in "bumps", paradoxically, beta values often find the same bumps. There are many reasons to believe that, in general, these "bumps" are the biologically relevant quantity of interest. In any event, absent a basis for such bump-hunting a priori, and when pretending that individual CpG loci are independent, concentrating on the most variable loci (on either scale, believe it or not) after preprocessing and normalization seems to increase power to detect real biological differences for the same reason as it does on expression arrays: the closer you get to the limit of detection, the more likely you are to see spurious results. The further you are from the limit of detection, the more likely you are to see higher overall variance (or MAD, or whatever) relative to the population. If you have a lot of strong confounders, of course, you'll have a different set of problems; but then you might ask why the experiment was designed in such a fashion if that were the case. > Please note that the 450K data is in some fundamental ways different to "regular" gene expression data and that specific tools might be more applicable. I will remark at this point that Kaushal has been using minfi and friends, so the pipeline isn't completely insane. Therefore I will address something that seems implicit in your remark, namely, the different sort of correlation (more spatial than dynamic, for lack of a better phrase) that is expected in DNA methylation data One could make the argument, and not without justification, that collapsing measurements onto "bumps" of significant regional changes is a more useful first step in this process, since they tend to suppress noise. But then you need to have some basis to define the bumps (are they defined by transcription factor footprints? By broad lamin attachment domains? By local correlation between CpGs?). Absent such a basis, it's often useful in exploratory analysis to consolidate your statistical power by testing a subset of highly variable loci, just as with any other high- dimensional data type where you believe the true signal to be sparse (if it is present at all). The justification presented in http://www.pnas.org/content/107/21/9546.long does not claim that genes on an expression array are independently expressed, simply that variance filtering empirically improves power to detect differences. Until such time as we have an unsupervised method which reliably detects regional changes of objectively superior value in 450k methylation data (for example, I finally got around to experimenting with the A-clustering method described in http://bioinformatics.oxfordjournals.org/content/29/22/2884 to evaluate it), variance filtering is not such a bad idea. And, again, in practical application you may be surprised to find that M-values and beta values both have their strengths and weaknesses. If you can squash *all* of the technical artifacts, M-values are theoretically more appealing, but my experience (across about 12,000 samples from various experiments) has been that said squashing is more difficult than the publication-biased literature might have you believe. Your mileage may vary! Thank you again for responding to Kaushal's query, and I hope you'll have some further remarks of your own. All the best, --t Statistics is the grammar of science. Karl Pearson <http: en.wikipedia.org="" wiki="" the_grammar_of_science=""> On Tue, May 6, 2014 at 7:53 AM, Martin Rijlaarsdam < m.a.rijlaarsdam@gmail.com> wrote: > Hi, > > Highly variable probes are not necessarily differentiating between groups > (if that is what you are aiming at). Please give some more information > about your experiment and what tool / procedure you use for testing and > correction for multiple testing. Also look at > > Wilhelm-Benartzi CS, Koestler DC, Karagas MR, Flanagan JM, Christensen BC, > Kelsey KT, et al. Review of processing and analysis methods for DNA > methylation array data. Br J Cancer. 2013;109(6):1394-402. > > for some more 450K specific ways to handle this data. I assume you used M > values for filtering, not beta? Please note that the 450K data is in some > fundamental ways different to "regular" gene expression data and that > specific tools might be more applicable. Also see. > > Dedeurwaerder S, Defrance M, Bizet M, Calonne E, Bontempi G, Fuks F. A > comprehensive overview of Infinium HumanMethylation450 data processing. > Briefings in bioinformatics. 2013. > > Kind regards, > Martin > > -- > M.A. (Martin) Rijlaarsdam MSc. MD > Erasmus MC - University Medical Center Rotterdam > Department of Pathology > Room Be-432b > Shipping adress: P.O. Box 2040, 3000 CA Rotterdam, The Netherlands > Visiting adress: Dr. Molewaterplein 50, 3015 GE Rotterdam, The Netherlands > > Email: m.a.rijlaarsdam@gmail.com > Mobile: +31 6 45408508 > Telephone (work): +31 10 7033409 > Fax +31 10 7044365 > Website: http://www.martinrijlaarsdam.nl > > > On Tue, May 6, 2014 at 4:43 PM, kaushal [guest] <guest@bioconductor.org> >wrote: > > > > > Hello list; > > > > I have 450 k human DNA methylation data. I used genefilter package to > get > > the 50% most variable CpG sites that gives me only half of the CpG sites > > for analysis that was originally in 450 K. However, CpG sites are still > > not significant according to adjsuted p-values. I am not quite sure what > > could be the reason for this? Thanks for any insights. > > > > Thanks !!! > > > > > > -- output of sessionInfo(): > > > > None > > > > -- > > Sent via the guest posting facility at bioconductor.org. > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD COMMENT • link 11.7 years ago Tim Triche ★ 4.2k

0

Entering edit mode

Dear Tim, Thank you for your reply. In my reply I was merely pointing out that Kaushals' original question, could be better answered if (more) information about the samples/sample size and the analysis pipeline would have been provided. In response, however, my answer might have also been too short. Looking at your response however, it looks like he has all the in- house knowledge he needs :-) With regard to variance filtering: I am not saying it is not a good first step to take (I use it as well). I was merely pointing out that selecting variable probes does not guarantee significantly differentiating probes (as I seemed to read in Kaushals' question). I agree that beta and M values can both be used, but the titration exeriment by Du et al (2010) clearly showed a much better trade-off between true positive rate and detection rate when using M values which is why I opt for the M values in statistical analysis until now. Thank you for the suggested literature. Kind regards, Martin -- M.A. (Martin) Rijlaarsdam MSc. MD Erasmus MC - University Medical Center Rotterdam Department of Pathology Room Be-432b Shipping adress: P.O. Box 2040, 3000 CA Rotterdam, The Netherlands Visiting adress: Dr. Molewaterplein 50, 3015 GE Rotterdam, The Netherlands Email: m.a.rijlaarsdam@gmail.com Mobile: +31 6 45408508 Telephone (work): +31 10 7033409 Fax +31 10 7044365 Website: http://www.martinrijlaarsdam.nl On Tue, May 6, 2014 at 7:46 PM, Tim Triche, Jr. <tim.triche@gmail.com>wrote: > Hi Martin, > > > I'm one of the people who sent Kaushal to bioc-list, both for a second > opinion and to create a public record of these discussions. Thank you for > responding -- I feared that I'd led the original poster astray! I hope you > will take the following comments in the spirit they are intended, namely, > as fodder for discussion, rather than criticism. > > > > Highly variable probes are not necessarily differentiating between > groups (if that is what you are aiming at). > > This is an odd assertion, which perhaps I am misunderstanding. However, > without some degree of variation to partition, it is difficult if not > impossible to determine what is biological or technical in origin, and what > is condition-specific. Moreover, the smaller the variance, the more likely > it is to be technical (rather than biological) in origin. Can you provide > an example of a relevant biological difference where overall variability > would mask the effect? > > > > [ re: Dedeurwaerder S, Defrance M, Bizet M, Calonne E, Bontempi G, Fuks > F. A comprehensive overview of Infinium HumanMethylation450 data > processing. Briefings in bioinformatics. 2013. and similar... ] > > I'd like to suggest > http://biorxiv.org/content/biorxiv/early/2014/02/23/002956.full.pdf , > which beautifully disposes of a great deal of the BS surrounding > normalization of 450k data. If there is a more relevant treatment of > general-purpose preprocessing and normalization on Illumina 450k (or 27k, > for that matter) data, I haven't seen it. The fundamental problem with > many (most?) of the other reviews is their limited scope, editor- appeasing > benchmarks, and a tendency to replace objective comparisons with so- called > "expert opinion". JP Fortin's paper avoids all that. Do have a look! > > > > I assume you used M values for filtering, not beta? > > While the quasi-linearizing effects of the logit transform lend themselves > well to the assumptions we like to make as statisticians (the multivariate > normal distribution does possess many useful mathematical properties), it's > not at all clear to me that the fold-change associations "discovered" at > individual loci are always worth noting. When they hold up across > multiple-locus bumps after normalization, on the other hand, the findings > tend to be more interesting. However, when they hold up across multiple > loci in "bumps", paradoxically, beta values often find the same bumps. > There are many reasons to believe that, in general, these "bumps" are the > biologically relevant quantity of interest. > > In any event, absent a basis for such bump-hunting a priori, and when > pretending that individual CpG loci are independent, concentrating on the > most variable loci (on either scale, believe it or not) after preprocessing > and normalization seems to increase power to detect real biological > differences for the same reason as it does on expression arrays: the closer > you get to the limit of detection, the more likely you are to see spurious > results. The further you are from the limit of detection, the more likely > you are to see higher overall variance (or MAD, or whatever) relative to > the population. If you have a lot of strong confounders, of course, you'll > have a different set of problems; but then you might ask why the experiment > was designed in such a fashion if that were the case. > > > > Please note that the 450K data is in some fundamental ways different to > "regular" gene expression data and that > specific tools might be more applicable. > > I will remark at this point that Kaushal has been using minfi and friends, > so the pipeline isn't completely insane. Therefore I will address something > that seems implicit in your remark, namely, the different sort of > correlation (more spatial than dynamic, for lack of a better phrase) that > is expected in DNA methylation data > > One could make the argument, and not without justification, that > collapsing measurements onto "bumps" of significant regional changes is a > more useful first step in this process, since they tend to suppress noise. > But then you need to have some basis to define the bumps (are they defined > by transcription factor footprints? By broad lamin attachment domains? By > local correlation between CpGs?). Absent such a basis, it's often useful > in exploratory analysis to consolidate your statistical power by testing a > subset of highly variable loci, just as with any other high- dimensional > data type where you believe the true signal to be sparse (if it is present > at all). The justification presented in > http://www.pnas.org/content/107/21/9546.long does not claim that genes on > an expression array are independently expressed, simply that variance > filtering empirically improves power to detect differences. > > Until such time as we have an unsupervised method which reliably detects > regional changes of objectively superior value in 450k methylation data > (for example, I finally got around to experimenting with the A-clustering > method described in > http://bioinformatics.oxfordjournals.org/content/29/22/2884 to evaluate > it), variance filtering is not such a bad idea. And, again, in practical > application you may be surprised to find that M-values and beta values both > have their strengths and weaknesses. If you can squash *all* of the > technical artifacts, M-values are theoretically more appealing, but my > experience (across about 12,000 samples from various experiments) has been > that said squashing is more difficult than the publication-biased > literature might have you believe. Your mileage may vary! > > > Thank you again for responding to Kaushal's query, and I hope you'll have > some further remarks of your own. > > All the best, > > --t > > > > > Statistics is the grammar of science. > Karl Pearson <http: en.wikipedia.org="" wiki="" the_grammar_of_science=""> > > > On Tue, May 6, 2014 at 7:53 AM, Martin Rijlaarsdam < > m.a.rijlaarsdam@gmail.com> wrote: > >> Hi, >> >> Highly variable probes are not necessarily differentiating between groups >> (if that is what you are aiming at). Please give some more information >> about your experiment and what tool / procedure you use for testing and >> correction for multiple testing. Also look at >> >> Wilhelm-Benartzi CS, Koestler DC, Karagas MR, Flanagan JM, Christensen BC, >> Kelsey KT, et al. Review of processing and analysis methods for DNA >> methylation array data. Br J Cancer. 2013;109(6):1394-402. >> >> for some more 450K specific ways to handle this data. I assume you used M >> values for filtering, not beta? Please note that the 450K data is in some >> fundamental ways different to "regular" gene expression data and that >> specific tools might be more applicable. Also see. >> >> Dedeurwaerder S, Defrance M, Bizet M, Calonne E, Bontempi G, Fuks F. A >> comprehensive overview of Infinium HumanMethylation450 data processing. >> Briefings in bioinformatics. 2013. >> >> Kind regards, >> Martin >> >> -- >> M.A. (Martin) Rijlaarsdam MSc. MD >> Erasmus MC - University Medical Center Rotterdam >> Department of Pathology >> Room Be-432b >> Shipping adress: P.O. Box 2040, 3000 CA Rotterdam, The Netherlands >> Visiting adress: Dr. Molewaterplein 50, 3015 GE Rotterdam, The Netherlands >> >> Email: m.a.rijlaarsdam@gmail.com >> Mobile: +31 6 45408508 >> Telephone (work): +31 10 7033409 >> Fax +31 10 7044365 >> Website: http://www.martinrijlaarsdam.nl >> >> >> On Tue, May 6, 2014 at 4:43 PM, kaushal [guest] <guest@bioconductor.org>> >wrote: >> >> > >> > Hello list; >> > >> > I have 450 k human DNA methylation data. I used genefilter package to >> get >> > the 50% most variable CpG sites that gives me only half of the CpG sites >> > for analysis that was originally in 450 K. However, CpG sites are still >> > not significant according to adjsuted p-values. I am not quite sure >> what >> > could be the reason for this? Thanks for any insights. >> > >> > Thanks !!! >> > >> > >> > -- output of sessionInfo(): >> > >> > None >> > >> > -- >> > Sent via the guest posting facility at bioconductor.org. >> > >> > _______________________________________________ >> > Bioconductor mailing list >> > Bioconductor@r-project.org >> > https://stat.ethz.ch/mailman/listinfo/bioconductor >> > Search the archives: >> > http://news.gmane.org/gmane.science.biology.informatics.conductor >> > >> >> [[alternative HTML version deleted]] >> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > [[alternative HTML version deleted]]

ADD REPLY • link 11.7 years ago Martin Rijlaarsdam ▴ 190

Login before adding your answer.