It has been reported that one of the problems of looking for enrichment of functional themes using 450k data is the bias introduced by the differing number of probes per gene. (See Geeleher et al, http://www.ncbi.nlm.nih.gov/pubmed/23732277)

The paper linked above describes correcting for this bias by adopting the probability-weighting function (pwf) in the package goseq, that is more usually used to correct for gene-length bias in RNA-seq data.

It is straightforward to use this probability weighting function to integrate with downstream gene ontology / KEGG enrichment analysis, but it is not obvious to me how you can adjust the p values from a 450k analysis of different groups using this method and then apply to an external process like Ingenuity Pathway Analysis.

Is there a code snippet that I could use or could help me understand how to use the pwf to adjust 450k p values in this way?

Do you mean that you want to use the output from the pwf function to adjust the p-values obtained from a differential methylation analysis? This is circular logic in that you need to first perform the differential methylation analysis, impose some sort of cut-off criteria to specify which probes are significant (which is usually based on the p-values), which is then fed into the pwf function and used to estimate the probability of significant differential methylation given the number of CpGs per gene. To me, it wouldn't make sense to apply it back to the original p-values which are used to estimate the prior probabilities.

I don't use Ingenuity Pathway Analysis, so I am not entirely sure what it takes as input. I have written a function in the missMethyl package called gometh (in the development version of the package) that simply takes a list of significant CpGs and estimates the prior probabilities using the pwf function and outputs a dataframe with the GO categories and associated output (p-values, FDRs etc), taking into account the bias. However based on your question I don't think this is what you want to do.

My only other thought is that you be more selective in terms of how genes are chosen based on differential methylation of CpG sites. You might want to consider combining CpG site level p-values at the gene level using Sime's method (for example), which is thought to be robust to positively correlated statistics. Imposing a multiple testing adjustment within genes would force the p-values for a gene with lots of probes to be more heavily adjusted than a gene with only a handful of probes. However, I have not tested this out and I am not aware of any R functions to do this specifically for the 450K array.

Thanks Belinda for your comment, and I take your point that there is a circularity in my arguments. So is it correct to say that the output of the pwf function is the probability of significant methylation for that gene given the number of probes that map to that gene and the p values from the differential analysis for those probes?

If that is the case, then could I use the pwf outputs to determine whether or not to enter a gene into Ingenuity, which essentially takes a list of genes as input?

Thanks Belinda for your comment, and I take your point that there is a circularity in my arguments. So is it correct to say that the output of the pwf function is the probability of significant methylation for that gene given the number of probes that map to that gene

andthe p values from the differential analysis for those probes?If that is the case, then could I use the pwf outputs to determine whether or not to enter a gene into Ingenuity, which essentially takes a list of genes as input?

Thanks again,

Ed