Hello Binbin,
It would be helpful to describe your problem and post to the whole
message board. (There are many experts who probably can be more
helpful than myself :-)) That said, I think you are referring to the
"NaN" error and below are my thoughts (Julie Zhu also answered this a
couple of times and her reply is probably in the archives).
When calling the makeVennDiagram function you want to set the
totalTest number to something that is larger than the experimentally
determined peak number. As far as I know, the totalTest number is
used for the hypergeometric sampling that is used to determine if the
overlap between two datasets is more than would be expected by chance.
So one way to sort this out using biological information is to think
about the maximum number of possible binding events and use that as
the totalTest number. For example, if you are studying a sequence-
specific DNA binding protein with a known motif you could count that
number of times that motif occurs in the genome and compare that to
the number of peaks you have experimentally determined.
Motifs = 500
Peaks = 200
Peaks w/ motif = 180 (90%)
"upper limit" = 500
new "upper limit" for totalTest = .9 x 500 = 450
Now if your working with a sequence-independent binding factor it can
get tricky. One approach would be to determine the mean peak width.
Then divide the whole genome sequence by this number to get an upper
limit. This is probably way to high so using additional information
such as if the protein binds intergenic or ORFs could bring the number
down but make it more relevant to the biological experiment. For
example:
peaks = 75
intergenic peaks = 70
ORF peaks = 5
mean peak width = 50 base pairs
genome size = 10000 base pairs
"upper limit" = 10000/ 50 = 200 (possible peaks)
intergenic seq = 4000 base pairs
new "upper limit" = 4000/50 = 80 (possible intergenic peaks)
I was working with something more like the second case and I felt the
totalTest based on the total genome was quite relaxed and based on the
intergenic sequence only was quite stringent so somewhere in the
middle might be better but most importantly I feel I am standing on
some solid biological reasoning for determining the amount of
sampling.
Hope this helps and I would be interested to here if anybody has some
critiques of this approach or additional suggestions.
Best,
Noah
On Nov 16, 2010, at 7:31 AM, Binbin Liu wrote:
> Dear Noah,
>
> I saw your post on bioconductor mailing list regarding the totalTest
number for the P-val calculation in ChipPeakAnno :: makeVennDiagram().
I am having the same problem. Can I ask how you got it sorted?
>
>
> Many thanks.
>
> Binbin
Dear Noah,
Many thanks for your detailed explanation on how totalTest is defined.
What I am doing is similar to the second case. However, the TF we are
interested could bind anywhere on the genome. So with mm9 of 2.7E+9
and peak width <=200 bps , the totalTest is 1.35E+7. It seems very
computational costly to run ChIPpeakAnno. Nevertheless, do you think
it is reasonable?
Thanks,
Binbin
On 16 Nov 2010, at 18:41, Noah Dowell wrote:
> Hello Binbin,
>
> It would be helpful to describe your problem and post to the whole
message board. (There are many experts who probably can be more
helpful than myself :-)) That said, I think you are referring to the
"NaN" error and below are my thoughts (Julie Zhu also answered this a
couple of times and her reply is probably in the archives).
>
>
> When calling the makeVennDiagram function you want to set the
totalTest number to something that is larger than the experimentally
determined peak number. As far as I know, the totalTest number is
used for the hypergeometric sampling that is used to determine if the
overlap between two datasets is more than would be expected by chance.
So one way to sort this out using biological information is to think
about the maximum number of possible binding events and use that as
the totalTest number. For example, if you are studying a sequence-
specific DNA binding protein with a known motif you could count that
number of times that motif occurs in the genome and compare that to
the number of peaks you have experimentally determined.
>
> Motifs = 500
> Peaks = 200
> Peaks w/ motif = 180 (90%)
> "upper limit" = 500
> new "upper limit" for totalTest = .9 x 500 = 450
>
> Now if your working with a sequence-independent binding factor it
can get tricky. One approach would be to determine the mean peak
width. Then divide the whole genome sequence by this number to get an
upper limit. This is probably way to high so using additional
information such as if the protein binds intergenic or ORFs could
bring the number down but make it more relevant to the biological
experiment. For example:
>
> peaks = 75
> intergenic peaks = 70
> ORF peaks = 5
> mean peak width = 50 base pairs
> genome size = 10000 base pairs
> "upper limit" = 10000/ 50 = 200 (possible peaks)
> intergenic seq = 4000 base pairs
> new "upper limit" = 4000/50 = 80 (possible intergenic peaks)
>
> I was working with something more like the second case and I felt
the totalTest based on the total genome was quite relaxed and based on
the intergenic sequence only was quite stringent so somewhere in the
middle might be better but most importantly I feel I am standing on
some solid biological reasoning for determining the amount of
sampling.
>
> Hope this helps and I would be interested to here if anybody has
some critiques of this approach or additional suggestions.
>
> Best,
>
> Noah
>
>
>
> On Nov 16, 2010, at 7:31 AM, Binbin Liu wrote:
>
>> Dear Noah,
>>
>> I saw your post on bioconductor mailing list regarding the
totalTest number for the P-val calculation in ChipPeakAnno ::
makeVennDiagram(). I am having the same problem. Can I ask how you got
it sorted?
>>
>>
>> Many thanks.
>>
>> Binbin
>
Binbin,
In the current implementation of makeVennDiagram, the time used to
calculate
p-value does not depend on the totalTest.
Noah, thanks so much for sharing your insights!
Best regards,
Julie
On 11/18/10 11:26 AM, "Binbin Liu" <b.b.liu at="" leeds.ac.uk=""> wrote:
> Dear Noah,
>
> Many thanks for your detailed explanation on how totalTest is
defined. What I
> am doing is similar to the second case. However, the TF we are
interested
> could bind anywhere on the genome. So with mm9 of 2.7E+9 and peak
width <=200
> bps , the totalTest is 1.35E+7. It seems very computational costly
to run
> ChIPpeakAnno. Nevertheless, do you think it is reasonable?
>
>
> Thanks,
>
> Binbin
>
>
> On 16 Nov 2010, at 18:41, Noah Dowell wrote:
>
>> Hello Binbin,
>>
>> It would be helpful to describe your problem and post to the whole
message
>> board. (There are many experts who probably can be more helpful
than myself
>> :-)) That said, I think you are referring to the "NaN" error and
below are
>> my thoughts (Julie Zhu also answered this a couple of times and her
reply is
>> probably in the archives).
>>
>>
>> When calling the makeVennDiagram function you want to set the
totalTest
>> number to something that is larger than the experimentally
determined peak
>> number. As far as I know, the totalTest number is used for the
>> hypergeometric sampling that is used to determine if the overlap
between two
>> datasets is more than would be expected by chance. So one way to
sort this
>> out using biological information is to think about the maximum
number of
>> possible binding events and use that as the totalTest number. For
example,
>> if you are studying a sequence-specific DNA binding protein with a
known
>> motif you could count that number of times that motif occurs in the
genome
>> and compare that to the number of peaks you have experimentally
determined.
>>
>> Motifs = 500
>> Peaks = 200
>> Peaks w/ motif = 180 (90%)
>> "upper limit" = 500
>> new "upper limit" for totalTest = .9 x 500 = 450
>>
>> Now if your working with a sequence-independent binding factor it
can get
>> tricky. One approach would be to determine the mean peak width.
Then divide
>> the whole genome sequence by this number to get an upper limit.
This is
>> probably way to high so using additional information such as if the
protein
>> binds intergenic or ORFs could bring the number down but make it
more
>> relevant to the biological experiment. For example:
>>
>> peaks = 75
>> intergenic peaks = 70
>> ORF peaks = 5
>> mean peak width = 50 base pairs
>> genome size = 10000 base pairs
>> "upper limit" = 10000/ 50 = 200 (possible peaks)
>> intergenic seq = 4000 base pairs
>> new "upper limit" = 4000/50 = 80 (possible intergenic peaks)
>>
>> I was working with something more like the second case and I felt
the
>> totalTest based on the total genome was quite relaxed and based on
the
>> intergenic sequence only was quite stringent so somewhere in the
middle might
>> be better but most importantly I feel I am standing on some solid
biological
>> reasoning for determining the amount of sampling.
>>
>> Hope this helps and I would be interested to here if anybody has
some
>> critiques of this approach or additional suggestions.
>>
>> Best,
>>
>> Noah
>>
>>
>>
>> On Nov 16, 2010, at 7:31 AM, Binbin Liu wrote:
>>
>>> Dear Noah,
>>>
>>> I saw your post on bioconductor mailing list regarding the
totalTest number
>>> for the P-val calculation in ChipPeakAnno :: makeVennDiagram(). I
am having
>>> the same problem. Can I ask how you got it sorted?
>>>
>>>
>>> Many thanks.
>>>
>>> Binbin
>>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>