Question

Using Bumphunter and SVA together in Minfi: bumphunterEngine nullMethod warning and FWER calculations

0

Entering edit mode

brhead ▴ 40

@brhead-11927

Last seen 7.8 years ago

Hi all,

I have 3 related questions about the Bumphunter implementation in Minfi. Help with one or all of them is much appreciated!

1. Clarification of what Bumphunter is doing:

After reading the bumphunter section of the Minfi tutorial:

https://www.bioconductor.org/help/course-materials/2014/BioC2014/minfi_BioC2014.pdf

which refers to this bump hunting paper:

http://ije.oxfordjournals.org/content/41/1/200.full

... I got the impression that the Minfi implementation of bumphunter would automatically run SVA and include the SV covariates in models. However, after reading the Bumphunter User's Guide:

https://bioconductor.org/packages/release/bioc/vignettes/bumphunter/inst/doc/bumphunter.pdf

... it sounds like Minfi does NOT run SVA ("Notably, batch effect removal and the application of the bootstrap to linear models of Efron and Tibshirani [1] need additional code.").

Can someone clarify whether the Minfi bumphunter runs SVA?

2. Choice of nullMethod:

If I run SVA separately and then include the resulting SV covariates in my design matrix when running Bumphunter, using the default "permutation" nullMethod, I get a warning:

> dmrs.cd8.b1000.pickCutoff.sva <- bumphunter(GRSet.norm.na.good.noSnp.noXreact.CD4, design = designMatrix.cd4, pickCutoff=T, B=1000, type="M")

[bumphunterEngine] The use of the permutation test is not recommended with multiple covariates, (ncol(design)>2). Consider changing 'nullMethod' changed to 'bootstrap' instead. See vignette for more information.

Why is the permutation method not recommended for models that include covariates? Here is what the bumphunter vignette says:

"However, when X has columns other than those representing an intercept term and the
covariate of interest, the permutation test approach is not recommended. The function will
run but give a warning. A method based on the bootstrap for linear models of Efron and
Tibshirani [1] may be more appropriate but this is not currently implemented."

However, when I run with the "bootstrap" nullMethod instead, I get a lot of "Inf" values and pvalues of 0 in the results table:

> dmrs.cd4.b1000.pickCutoff.sva.boot <- bumphunter(GRSet.norm.na.good.noSnp.noXreact.CD4, design = designMatrix.cd4, pickCutoff=T, B=1000, type="M", nullMethod='bootstrap')

> head(dmrs.cd4.b1000.pickCutoff.sva.boot$table)
       chr     start       end value area cluster indexStart indexEnd L
222   chr1 101491640 101491640   Inf  Inf   19234      39105    39105 1
518  chr10 106014410 106014410   Inf  Inf   53151     434838   434838 1
713  chr11  70601971  70601971   Inf  Inf   69955     472762   472762 1
716  chr11  71791563  71791563   Inf  Inf   70239     473310   473310 1
742  chr11  93862060  93862060   Inf  Inf   73067     478541   478541 1
1179 chr14  69444462  69444462   Inf  Inf  116139     563513   563513 1
     clusterL p.value fwer p.valueArea fwerArea
222        15       0    0           0        0
518        18       0    0           0        0
713         4       0    0           0        0
716        20       0    0           0        0
742        15       0    0           0        0
1179       22       0    0           0        0

It appears that something is going wrong with the bootstrap method when including covariates. How should I interpret the values and areas of "Inf"? The permutation method, in contrast, seems to work great; I just don't know whether there is a reason I shouldn't trust those results.

3. Different p values and error rates.

Finally, I am confused about the two different p values and family-wise error rates that are output by the program (p.value, fwer, p.valueArea, fwerArea). Can someone please explain what the differences are? My impression is that "p.value" is the empirical p value from the permutation test (the fraction of null areas greater than the area that was actually observed), and that the "fwer" is that p.value adjusted for multiple hypotheses (by Bonferroni or another method?), but please let me know if this is incorrect.

Thanks,

Brooke

minfi sva bumphunter • 2.7k views

ADD COMMENT • link 8.9 years ago • updated 8.8 years ago brhead ▴ 40

score 0 · Answer 1 · 2016-11-28

0

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 16 hours ago

United States

Neither bumphunter nor minfi use SVA natively, although (as the authors argue in the paper you reference), it might be a good idea to do so.
Permutations aren't a good idea when you have other covariates, because the covariates are intended to control for specific attributes of individual samples, and you wouldn't want to randomly assign those attributes to different subjects. Put another way, under the null you are assuming that the subjects are exchangeable, but if you are saying there are specific phenotypic differences that you need to control for, then they really aren't exchangeable are they? As for why you get those results under the bootstrap, note that all of your regions are a single CpG site, which doesn't look right to me. What does the design matrix look like?
The p-values are the proportion of the bootstrapped results for the given region that were as large or larger than what you observed for that region (this is a conventional permutation p-value), based on either the smoothed betas, or the area of the peak. The FWER are the family wise error rates, which is the proportion of bootstrapped results for which any region was as or more extreme as the observed results for the region under consideration. In other words, in the first case we just compare the bootstraps for a particular bump to the observed value of that bump, but for the fwer we are comparing the bootstraps for all bumps to the observed value of that bump.

ADD COMMENT • link 8.9 years ago James W. MacDonald 68k

0

Entering edit mode

Hi James,

Thank you for your clear response! For question 2, I understand now why permutations are not recommended when including covariates. Here's what my design matrix looks like:

designMatrix.cd4 <- model.matrix(~ co$Case_Control + co$age + co$Slide + co$V1+co$V2+co$V3+co$V4+co$V5)

> designMatrix.cd4
   (Intercept) co$Case_Controlcontrol co$age co$Slide200511420047 co$Slide200511490138
1            1                      1     40                    0                    0
2            1                      0     21                    0                    0
3            1                      1     26                    0                    0
4            1                      0     45                    0                    0
5            1                      1     36                    0                    0
6            1                      0     40                    0                    0
7            1                      1     39                    0                    0
8            1                      0     42                    0                    0
9            1                      1     44                    1                    0
10           1                      1     32                    1                    0
11           1                      1     31                    1                    0
12           1                      0     28                    1                    0
13           1                      0     41                    1                    0
14           1                      0     27                    1                    0
15           1                      1     50                    1                    0

(truncated for brevity)

   co$Slide200514030024        co$V1        co$V2        co$V3        co$V4        co$V5
1                     0  0.047923523 -0.181910479  0.060895002  0.866685378 -0.214546150
2                     0 -0.188453422  0.179402259 -0.252131374 -0.025424623  0.023028849
3                     0  0.268085456 -0.032507710 -0.019089640  0.008836012  0.079966305
4                     0  0.042117679 -0.036962380  0.184850866 -0.016037051  0.127193459
5                     0  0.167618719  0.194599225 -0.305029891  0.067590586 -0.258778242
6                     0 -0.276843340 -0.094691471 -0.010369113 -0.069367958  0.043236105
7                     0 -0.049253300  0.108484411  0.116331113  0.033721849  0.043213022
8                     0  0.266299403 -0.054938056 -0.076866000 -0.110844577  0.009955268
9                     0  0.108190754 -0.097773476  0.223017372 -0.148301926  0.080420394
10                    0  0.025893949 -0.268081823  0.354818493 -0.285196009 -0.667969895
11                    0  0.207198113 -0.205765185  0.116219149 -0.005186683  0.128944151
12                    0  0.174724057  0.101925560 -0.120504564 -0.132569485  0.043161686
13                    0 -0.232423159  0.228806618  0.227573437  0.154334952  0.065421929
14                    0  0.241905966  0.284997292  0.091716412 -0.061444589  0.194972524
15                    0 -0.064998471  0.212879335 -0.184773305  0.012051083 -0.195159122
(truncated for brevity)

attr(,"assign")
 [1] 0 1 2 3 3 3 4 5 6 7 8
attr(,"contrasts")
attr(,"contrasts")$`co$Case_Control`
[1] "contr.treatment"

attr(,"contrasts")$`co$Slide`
[1] "contr.treatment"

A little more information: out of ~39,000 bumps, ~50 have "Inf" or "-Inf", and then I start seeing results I would expect. (And I see some of the same significant regions that were identified with permutation tests.)

For question 3, I think the part I don't understand is how p-values can be based on smoothed betas rather than the areas of the peaks. At any rate, would it be appropriate to adjust either p.value or p.valueArea to control the FDR, if I wanted to use something less strict that the FWER?

-Brooke

ADD REPLY • link 8.9 years ago brhead ▴ 40

0

Entering edit mode

If you have 39K peaks, then you need to restrict further. That's just too many to be useful. I would argue that a number of peaks < 100 is a tractable set of peaks, after which you start running into problems with a) crappy little one-CpG peaks that are probably not reliable, and b) just too much data to get a reasonable inference from. I usually don't let bumphunter decide on the cutoff, and instead (IIRC) use pickCutoffQ to set the quantile to like 95% or 99%.

You will get output saying how many bumps you have, and I usually play around with the cutoffs until I get a reasonable starting number of bumps. You can speed that process up by doing no permutations (e.g., set B = 0), and starting with a really restrictive cutoff, which will go really fast. You can then relax the cutoff until you have the number of bumps you want, and once you get a set of bumps that you think is a reasonable N, you can then go back to B = 1000.

For question 3, maybe I wasn't as explicit as I could have been. What bumphunter does is to first go through and compute a beta for each CpG, based on the design matrix you specified. It then defines clusters of CpGs based on the genomic distance you specified (in your case you specified the default of a maximum 500 nt gap between CpGs, so any CpGs in a cluster are all within 500 nt of their adjacent neighbor). If you say to smooth (which you did), then any cluster with 7 or more CpGs in it (this is the default for locfitByCluster) will then have the individual betas smoothed, based on a locally weighted smoother (locfit from the locfit package).

Then bumphunter defines the cluster 'beta' as the mean of the betas within the cluster. The p-value for the cluster beta is then the proportion of the null (bootstraps) betas for that cluster that were as large or larger than the observed cluster beta.

As for whether or not you should use FWER or FDR or whatever, that's a difficult question. In a pedantic sense, you should be very wary of interpreting the p-values and FWER to begin with. Both the p-value and FWER have almost nothing to do with the observed value, and instead are based entirely on the null distribution. Remember that a p-value is the expected proportion of null statistics that equal or exceed the observed statistic. The observed statistic is simply used as a cutoff to estimate the proportion of null statistics we expect to equal or exceed it. But that presupposes we can accurately generate null statistics! If we aren't doing that right, then all bets are off.

ADD REPLY • link 8.9 years ago James W. MacDonald 68k

0

Entering edit mode

Dear James,

Thank you for all of your help with my bumphunter questions. I was able to solve the "Inf" problem . . . it turns out that there were ~100 infinite M values in my matrix. I'm not sure why they were there, but once I removed them, bumphunter stopped producing "Inf" and "-Inf" values in the output table when using bootstrap.

I have also changed the cutoff using a pickCutoffQ of .99 or even 0.999 (results in ~700 bumps), but I still get some "crappy little one-CpG peaks." I have one more question about those: what is the difference between L and clusterL in the bumphunter output table? I see some instances where L=1, but clusterL is something much higher, like 9.

-Brooke

ADD REPLY • link 8.8 years ago brhead ▴ 40