Question

Question on Hypergeometric P value generated from makeVennDiagram in ChIPpeakAnno

0

Entering edit mode

Noah Dowell ▴ 410

@noah-dowell-3791

Last seen 9.6 years ago

Hello to All, I could not find this in the archives so I figured I could post this to the board. It might be to much of an R question, but it arose during the use of the excellent ChIPpeakAnno package. Problem and Example: I have ChIP-chip binding data for two transcription factors (TF) and I have mapped their binding to genomic positions and I would like to see if there is significant overlap between the binding peaks. The makeVennDiagram function in ChIPpeakAnno does this for me, but it still seems unclear how the parameters for p.value generation should be set. Here are the function inputs: makeVennDiagram(RangedDataList(smallAnnoTFA, smallAnnoTFB), NameOfPeaks=c("TFA", "TFB"), maxgap = 0, totalTest= 10000, cex = 1, counts.col = "purple") totalTest specifies the total number of tests performed to obtain the list of peaks. This number should be larger than the largest number of peaks in A and B. So here are example data: TFA peaks = 5118 TFB peaks = 1067 overlapTFA_TFB = 708 totalTest = 7500 gives a p.value = 0.9279632 totalTest = 10000 gives a p.value = 2.39585e-26 So I looked at the makeVennDiagram function to see how this p value is being generated and find the phyper function is being used. For example: p.value.overall1 = phyper(p1.and.p2.and.p3 - 1, p1, totalTest - p1, p2.and.p3, lower.tail = FALSE, log.p = FALSE) where: p1.and.p2.and.p3 = length(unique(overlappingPeaks123[[NameOfPeaks[1]]])) p1 = length(rownames(Peaks[[1]])) p2.and.p3 = counts3[4, 3] in my example data from above: p1.and.p2.and.p3 = 708 p1 =5118 totalTest - p1 = 7500 -5118 p2.and.p3 = 1067 (I think???) So I try to relate this to the base function in R and I get confused: phyper(q, m, n, k, lower.tail = TRUE, log.p = FALSE) the assignment of q, m, n, and k in the makeVennDiagram function does not seem to follow the definition of q, m, n, k on the R help page. The variable m follows as expected, but I figured the k variable should be 708 in my example data and not 1067. Additionally, the variable n in the makeVennDiagram function depends on how totalTest is defined, but shouldn't it be defined by the peak number of TFB? The larger I make n (by increasing my totalTest number) in the function then the smaller my p.value gets. I am a little worried that I could be creating an artifact due to my incomplete understanding of how the hypergeometric test is being implemented. Thank you for any help you can provide. Noah > sessionInfo() R version 2.11.0 (2010-04-22) i386-apple-darwin9.8.0 locale: [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] org.Sc.sgd.db_2.4.1 ChIPpeakAnno_1.4.0 limma_3.4.0 [4] org.Hs.eg.db_2.4.1 GO.db_2.4.1 RSQLite_0.8-4 [7] DBI_0.2-5 AnnotationDbi_1.10.0 BSgenome.Ecoli.NCBI.20080805_1.3.16 [10] BSgenome_1.16.0 GenomicRanges_1.0.1 Biostrings_2.16.0 [13] multtest_2.4.0 Biobase_2.8.0 biomaRt_2.4.0 [16] rtracklayer_1.8.1 RCurl_1.3-1 bitops_1.0-4.1 [19] IRanges_1.6.0 loaded via a namespace (and not attached): [1] MASS_7.3-5 splines_2.11.0 survival_2.35-8 tools_2.11.0 XML_2.8-1 [[alternative HTML version deleted]]

Transcription GO ChIPpeakAnno Transcription GO ChIPpeakAnno • 1.8k views

ADD COMMENT • link updated 13.8 years ago by Julie Zhu ★ 4.3k • written 13.8 years ago by Noah Dowell ▴ 410

score 0 · Answer 1 · 2010-07-26

Noah, Thanks for the positive feedback! For ChIP-chip experiment, the parameter totalTest could be estimated by counting the total number of peaks with p-value =1 or FDR=1 depending on how you obtained your peaks. FYI, I will be giving a practical tutoring in the Bioconductor meeting this week. We could discuss it further face to face if you happen to attend the meeting as well. Otherwise, we can schedule a time to talk. Kind regards, Julie On 7/24/10 6:04 PM, "Noah Dowell" <noahd@ucla.edu> wrote: Hello to All, I could not find this in the archives so I figured I could post this to the board. It might be to much of an R question, but it arose during the use of the excellent ChIPpeakAnno package. Problem and Example: I have ChIP-chip binding data for two transcription factors (TF) and I have mapped their binding to genomic positions and I would like to see if there is significant overlap between the binding peaks. The makeVennDiagram function in ChIPpeakAnno does this for me, but it still seems unclear how the parameters for p.value generation should be set. Here are the function inputs: makeVennDiagram(RangedDataList(smallAnnoTFA, smallAnnoTFB), NameOfPeaks=c("TFA", "TFB"), maxgap = 0, totalTest= 10000, cex = 1, counts.col = "purple") totalTest specifies the total number of tests performed to obtain the list of peaks. This number should be larger than the largest number of peaks in A and B. So here are example data: TFA peaks = 5118 TFB peaks = 1067 overlapTFA_TFB = 708 totalTest = 7500 gives a p.value = 0.9279632 totalTest = 10000 gives a p.value = 2.39585e-26 So I looked at the makeVennDiagram function to see how this p value is being generated and find the phyper function is being used. For example: p.value.overall1 = phyper(p1.and.p2.and.p3 - 1, p1, totalTest - p1, p2.and.p3, lower.tail = FALSE, log.p = FALSE) where: p1.and.p2.and.p3 = length(unique(overlappingPeaks123[[NameOfPeaks[1]]])) p1 = length(rownames(Peaks[[1]])) p2.and.p3 = counts3[4, 3] in my example data from above: p1.and.p2.and.p3 = 708 p1 =5118 totalTest - p1 = 7500 -5118 p2.and.p3 = 1067 (I think???) So I try to relate this to the base function in R and I get confused: phyper(q, m, n, k, lower.tail = TRUE, log.p = FALSE) the assignment of q, m, n, and k in the makeVennDiagram function does not seem to follow the definition of q, m, n, k on the R help page. The variable m follows as expected, but I figured the k variable should be 708 in my example data and not 1067. Additionally, the variable n in the makeVennDiagram function depends on how totalTest is defined, but shouldn't it be defined by the peak number of TFB? The larger I make n (by increasing my totalTest number) in the function then the smaller my p.value gets. I am a little worried that I could be creating an artifact due to my incomplete understanding of how the hypergeometric test is being implemented. Thank you for any help you can provide. Noah > sessionInfo() R version 2.11.0 (2010-04-22) i386-apple-darwin9.8.0 locale: [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] org.Sc.sgd.db_2.4.1 ChIPpeakAnno_1.4.0 limma_3.4.0 [4] org.Hs.eg.db_2.4.1 GO.db_2.4.1 RSQLite_0.8-4 [7] DBI_0.2-5 AnnotationDbi_1.10.0 BSgenome.Ecoli.NCBI.20080805_1.3.16 [10] BSgenome_1.16.0 GenomicRanges_1.0.1 Biostrings_2.16.0 [13] multtest_2.4.0 Biobase_2.8.0 biomaRt_2.4.0 [16] rtracklayer_1.8.1 RCurl_1.3-1 bitops_1.0-4.1 [19] IRanges_1.6.0 loaded via a namespace (and not attached): [1] MASS_7.3-5 splines_2.11.0 survival_2.35-8 tools_2.11.0 XML_2.8-1 [[alternative HTML version deleted]]