Question

IHWpaper: choice of `nbins=4` in `proteomics_example_analysis.R`

1

Entering edit mode

paul.johnston ▴ 10

@pauljohnston-9518

Last seen 3 days ago

United Kingdom

I noticed that nbins is set manually in the ihw call in proteomics_example_analysis.R from IHWpaper:

ihw_res <- ihw(proteomics_df$pvalue,proteomics_df$X..peptides, .1, nbins=4, 

                 nsplits_internal=5, lambdas=seq(0,3,length=20))

If I change it to nbins = "auto", then I get the reduction to BH message:

Only 1 bin; IHW reduces to Benjamini Hochberg (uniform weights)

My question is how did you arrive at nbins=4? And how should I go about setting nbins for my own similar proteomics data where nbins = "auto" also reduces to BH?

ihw • 1.4k views

ADD COMMENT • link updated 7.5 years ago by Nikos Ignatiadis ▴ 180 • written 7.5 years ago by paul.johnston ▴ 10

score 4 · Accepted Answer · 2017-06-10

Hi Paul,

it's a good question. As is so often the case, this is just a bias/variance/computation time tradeoff. Here by bias-variance I refer to the estimation of the weight function from the rest of the folds and applying it to the held-out fold. In general, in the default choices in the IHW package we have opted for a more conservative route that will often shrink weights towards uniform, thus recovering results equal or very close to those of Benjamini-Hochberg. It is however important to note that this affects power, not FDR control.

By default (choice "auto") it is required that at least 1500 p-values are present in every bin. Through experience this leads to a good estimation of the underlying distribution. However, you might argue that even a somewhat noisier estimate of the distribution could lead to good estimation of the weight function, especially since we add a total variation ("fused lasso") type penalty. And this is indeed true -- but requires a good choice of the regularization parameter. To get a better handle on this we need to do cross validation nested within the fold splitting (specified by `nsplits_interal`). However, this increases computation time by quite a bit, which is why it is only done once by default (rather than e.g. 5 as in your example).

In any case, to answer your question: If you have a relatively small multiple testing situation (such as the proteomics example with only 2666 hypotheses), but still want to apply IHW, then you can use a larger number of bins than what is set by default. If you do this, I strongly encourage you to also increase `nsplits_internal`. Finally, I would not recommend using bins with less than 600 hypotheses or so.

Hope this helps,
Nikos