Question

Multiple testing correction for p.scores in FELLA

0

Entering edit mode

sergi.picart ▴ 20

@sergipicart-11865

Last seen 2.7 years ago

Germany

Dear developers, we received a number of questions in the direction of whether p.scores should be treated as p-values and corrected for multiple testing when using the FELLA package for metabolomics data analysis. For example:

I guess the p score is analogous to p value? There were 740 p.scores smaller than 0.05. Then if I apply multiple-testing correction using the p.adjust() function in R, none of the corrected 10558 p.scores was smaller than 0.05.

I am curious whether this method [diffusion] involves post hoc FDR correction like the hypergeometric test you included in the package does?

I am wondering if the P values that result from the diffusion analysis with simulations (i.e. 10000 iterations) in FELLA are corrected for multiple testing and, if so, how this is done?

We will provide some insights that will hopefully help on how to proceed and clarify this topic.

FELLA • 655 views

ADD COMMENT • link 2.7 years ago sergi.picart ▴ 20

score 0 · Answer 1 · 2021-11-01

Short answer: FELLA's p.scores from diffusion/pagerank are not corrected for multiple testing and not used as p-values, but as a statistically adjusted prioritiser.

Long answer:

In general, hypergeom uses the classical hypothesis testing (and results are therefore adjusted and called significant), whereas diffusion and pagerank are used as prioritisers. In the latter, scores are computed (the lower, the better) and sorted in ascending order. A threshold is applied to extract the top entities and return them as a sub-network, which typically contains large connected components.

For simplicity, both approximations normality and simulation have p-scores between 0 and 1 and go the same direction (lower is better). Specifically, normality uses the cumulative distribution function of a Gaussian distribution applied to the exact z-scores, capping at 1e-6, while simulation gives an empirical p-value by definition. Both approaches lying in [0, 1] means, in practice, that a default p-score cutoff will most likely work equally well. Besides, by default, up to 250 nodes are reported (even if more nodes have a p-score lower than 0.05), since manual examination of larger networks becomes cumbersome.

Can those p-scores be used as p-values? Yes, especially those obtained by simulation (since the null distribution does not need to be Gaussian), but it would move away from the current prioritiser paradigm into a hypothesis testing one. We experimented with that and often reached trivial conclusions, like only the input nodes being significant. If you want to further pursue that idea, you may try a correction like the FDR and stratify by node category. This would prevent smaller categories (pathways, modules) from being overly penalised by larger ones (compounds, reactions). We did not fully explore this path.

As a side note, FELLA also filters out connected components that are small enough so that they could come from a random selection of nodes. If an input list is noisy (meaning the compounds are not really proximal in the network) one would expect smaller connected components, often filtered out.