package or code to quantify the significance of the venn overlap between 2 or 3 lists of genes

0

Entering edit mode

k. brand ▴ 420

@k-brand-1874

Last seen 9.6 years ago

Dear BioCers, I've got six lists of gene's which i'm focused on the overlaps between. What i'm searching for is a package or code to quantify the significance of the overlap between both a pair of gene lists, and also between three gene-lists. Six might be interesting, but not necessary. Specifically, what would the overlap be expected by chance, and how many standard deviations my actual overlap is from the estimated chance overlap? Whilst some of my lists are independent, others are not in being derived from tissues of the same origin. I understand this would exclude such tests like Fishers Rxact test which assume independence. By using the same numbers of chip-background probes and short-listed probes of interest, randomly selected and checking the overlap, performed say 10,000 times, i think i could obtain the estimates i'm looking for in a 'statistically acceptable' manner. Does anyone know of a package or code written for this purpose? I failed to find anything in BioConductor or in the BioC lists. As simple as coding it no doubt is, my lack of R knowledge would make doing it with a calculator the faster option :) Look forward to any recommendations or suggestions with appreciation, Karl -- Karl Brand k.brand-asperand-erasmusmc.nl Department of Genetics Erasmus MC Dr Molewaterplein 50 3015 GE Rotterdam lab +31 (0)10 704 3409 fax +31 (0)10 704 4743 mob +31 (0)642 777 268

• 1.1k views

ADD COMMENT • link updated 14.1 years ago by Wolfgang Huber ★ 13k • written 14.1 years ago by k. brand ▴ 420

0

Entering edit mode

Wolfgang Huber ★ 13k

@wolfgang-huber-3550

Last seen 11 days ago

EMBL European Molecular Biology Laborat…

Dear Karl, I don't think what you need here is necessarily a package - the required computations, if possible, are one or a few lines of R using standard functions e.g. in the "stats" package such as phyper. Perhaps the more important thing to do is to precisely define the questions you want to be asking. For this, discussion with a local statistician might be helpful. Once you have that, the answer will probably be fairly obvious from a basic text book on combinatorics (probability theory on discrete variables). Best wishes Wolfgang Karl Brand scripsit 17/03/10 12:26: > Dear BioCers, > > I've got six lists of gene's which i'm focused on the overlaps between. > > What i'm searching for is a package or code to quantify the significance > of the overlap between both a pair of gene lists, and also between three > gene-lists. Six might be interesting, but not necessary. > > Specifically, what would the overlap be expected by chance, and how many > standard deviations my actual overlap is from the estimated chance overlap? > > Whilst some of my lists are independent, others are not in being derived > from tissues of the same origin. I understand this would exclude such > tests like Fishers Rxact test which assume independence. > > By using the same numbers of chip-background probes and short-listed > probes of interest, randomly selected and checking the overlap, > performed say 10,000 times, i think i could obtain the estimates i'm > looking for in a 'statistically acceptable' manner. > > Does anyone know of a package or code written for this purpose? I failed > to find anything in BioConductor or in the BioC lists. As simple as > coding it no doubt is, my lack of R knowledge would make doing it with a > calculator the faster option :) > > Look forward to any recommendations or suggestions with appreciation, > > Karl > > -- Best wishes Wolfgang -- Wolfgang Huber EMBL http://www.embl.de/research/units/genome_biology/huber/contact

ADD COMMENT • link 14.1 years ago Wolfgang Huber ★ 13k

0

Entering edit mode

Wolfgang Huber ★ 13k

@wolfgang-huber-3550

Last seen 11 days ago

EMBL European Molecular Biology Laborat…

Dear Karl [reposting to list] The bioinformatician was quicker, and provided a hack that "works", but a statistician might have pointed out that the simulation scheme you propose below is a needlessly poor and slow approximation of what the hypergeometric distribution or the Fisher text would do faster and more exactly. "Poor" because the distribution of count variables is (typically and in particular in your case) not symmetric and using a standard deviation to define a confidence interval and significance thresholds would ignore that - i.e. give suboptimal results. Don't get me wrong - I think it's great when people are capable to reinvent the wheel, but to get stuff done, using existing wheel designs tends to be more productive. PS I am not sure what you mean by "non-independent gene lists". If you already know that the lists are dependent, what exactly do you gain by showing that their overlap is higher than if they were independent? Isn't that tautological? Best wishes Wolfgang Karl Brand scripsit 17/03/10 15:45: > Cheers Wolfgang, > > Unfortuantly waiting on my local statistician also take's longer than > using the calculator :( > > Discussion with a much more responsive bioifnormatician yielded the plan > to employ a bootstrap/randomisation (terminology?!) approach. ie.: > > By using the same numbers of the chip-background probes (c. 45,000) and > my short-list of probes of interest (c. 500), randomly selected and > checking the overlap, performed say 10,000 times, an estimate of chance > overlap could be obtained, along with a stardard deviation to which i > could compare my actual results to for an estimate of significance, or > p-value. > > Correct me if we're wrong but this seemed acceptable for Venns of > non-independent gene lists. > > Coding this was what i was appealing for help on since my experience > here is limiting. But, i'm definately up for a crack at it. I'll start > by having a look at the "stats" package phyper. > > Again with appreciation for your prompt, thoughtful response, > > Karl > > On 3/17/2010 2:48 PM, Wolfgang Huber wrote: >> Dear Karl, >> >> I don't think what you need here is necessarily a package - the required >> computations, if possible, are one or a few lines of R using standard >> functions e.g. in the "stats" package such as phyper. >> >> Perhaps the more important thing to do is to precisely define the >> questions you want to be asking. For this, discussion with a local >> statistician might be helpful. Once you have that, the answer will >> probably be fairly obvious from a basic text book on combinatorics >> (probability theory on discrete variables). >> >> Best wishes >> Wolfgang >> >> >> Karl Brand scripsit 17/03/10 12:26: >>> Dear BioCers, >>> >>> I've got six lists of gene's which i'm focused on the overlaps between. >>> >>> What i'm searching for is a package or code to quantify the >>> significance of the overlap between both a pair of gene lists, and >>> also between three gene-lists. Six might be interesting, but not >>> necessary. >>> >>> Specifically, what would the overlap be expected by chance, and how >>> many standard deviations my actual overlap is from the estimated >>> chance overlap? >>> >>> Whilst some of my lists are independent, others are not in being >>> derived from tissues of the same origin. I understand this would >>> exclude such tests like Fishers Rxact test which assume independence. >>> >>> By using the same numbers of chip-background probes and short- listed >>> probes of interest, randomly selected and checking the overlap, >>> performed say 10,000 times, i think i could obtain the estimates i'm >>> looking for in a 'statistically acceptable' manner. >>> >>> Does anyone know of a package or code written for this purpose? I >>> failed to find anything in BioConductor or in the BioC lists. As >>> simple as coding it no doubt is, my lack of R knowledge would make >>> doing it with a calculator the faster option :) >>> >>> Look forward to any recommendations or suggestions with appreciation, >>> >>> Karl >>> >>> >> >> > -- Best wishes Wolfgang -- Wolfgang Huber EMBL http://www.embl.de/research/units/genome_biology/huber/contact

ADD COMMENT • link 14.1 years ago Wolfgang Huber ★ 13k

0

Entering edit mode

On Wed, Mar 17, 2010 at 11:16 AM, Wolfgang Huber <whuber at="" embl.de=""> wrote: > Dear Karl > > [reposting to list] > > The bioinformatician was quicker, and provided a hack that "works", but a > statistician might have pointed out that the simulation scheme you propose > below is a needlessly poor and slow approximation of what the hypergeometric > distribution or the Fisher text would do faster and more exactly. > > "Poor" because the distribution of count variables is (typically and in > particular in your case) not symmetric and using a standard deviation to > define a confidence interval and significance thresholds would ignore that - > i.e. give suboptimal results. > > Don't get me wrong - I think it's great when people are capable to reinvent > the wheel, but to get stuff done, using existing wheel designs tends to be > more productive. > > PS I am not sure what you mean by "non-independent gene lists". If you > already know that the lists are dependent, what exactly do you gain by > showing that their overlap is higher than if they were independent? Isn't > that tautological? > > ? ? ? ?Best wishes > ? ? ? ?Wolfgang > > > > Karl Brand scripsit 17/03/10 15:45: >> >> Cheers Wolfgang, >> >> Unfortuantly waiting on my local statistician also take's longer than >> using the calculator :( >> >> Discussion with a much more responsive bioifnormatician yielded the plan >> to employ a bootstrap/randomisation (terminology?!) approach. ie.: >> >> By using the same numbers of the chip-background probes (c. 45,000) and my >> short-list of probes of interest (c. 500), randomly ?selected and checking >> the overlap, performed say 10,000 times, an estimate of chance overlap could >> be obtained, along with a stardard deviation to which i could compare my >> actual results to for an estimate of significance, or p-value. Just to add to Wolfgang's sentiments here: Using random permutation testing is essentially assuming that the findings (both within sample and between samples) are "independent" of each other. Such permutation testing is useful for accounting for some other biases in the data (more than one probe per gene, for example). This isn't a bad way to go given that the dependencies and correlations are generally unknown, but it is important to realize that such an analysis has these underlying assumptions. Sean >> Correct me if we're wrong but this seemed acceptable for Venns of >> non-independent gene lists. >> >> Coding this was what i was appealing for help on since my experience here >> is limiting. But, i'm definately up for a crack at it. I'll start by having >> a look at the "stats" package phyper. >> >> Again with appreciation for your prompt, thoughtful response, >> >> Karl >> >> On 3/17/2010 2:48 PM, Wolfgang Huber wrote: >>> >>> Dear Karl, >>> >>> I don't think what you need here is necessarily a package - the required >>> computations, if possible, are one or a few lines of R using standard >>> functions e.g. in the "stats" package such as phyper. >>> >>> Perhaps the more important thing to do is to precisely define the >>> questions you want to be asking. For this, discussion with a local >>> statistician might be helpful. Once you have that, the answer will >>> probably be fairly obvious from a basic text book on combinatorics >>> (probability theory on discrete variables). >>> >>> Best wishes >>> Wolfgang >>> >>> >>> Karl Brand scripsit 17/03/10 12:26: >>>> >>>> Dear BioCers, >>>> >>>> I've got six lists of gene's which i'm focused on the overlaps between. >>>> >>>> What i'm searching for is a package or code to quantify the >>>> significance of the overlap between both a pair of gene lists, and >>>> also between three gene-lists. Six might be interesting, but not >>>> necessary. >>>> >>>> Specifically, what would the overlap be expected by chance, and how >>>> many standard deviations my actual overlap is from the estimated >>>> chance overlap? >>>> >>>> Whilst some of my lists are independent, others are not in being >>>> derived from tissues of the same origin. I understand this would >>>> exclude such tests like Fishers Rxact test which assume independence. >>>> >>>> By using the same numbers of chip-background probes and short- listed >>>> probes of interest, randomly selected and checking the overlap, >>>> performed say 10,000 times, i think i could obtain the estimates i'm >>>> looking for in a 'statistically acceptable' manner. >>>> >>>> Does anyone know of a package or code written for this purpose? I >>>> failed to find anything in BioConductor or in the BioC lists. As >>>> simple as coding it no doubt is, my lack of R knowledge would make >>>> doing it with a calculator the faster option :) >>>> >>>> Look forward to any recommendations or suggestions with appreciation, >>>> >>>> Karl >>>> >>>> >>> >>> >> > > > -- > > Best wishes > ? ? Wolfgang > > > -- > Wolfgang Huber > EMBL > http://www.embl.de/research/units/genome_biology/huber/contact > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD REPLY • link 14.1 years ago Sean Davis 21k

0

Entering edit mode

Thank you Wolfgang, Madelaine, I'd rather not reinvent the wheel if i can help it. And if you you'll humor me a little longer, perhaps you can ensure i do what you suggest correctly for my exact application. The overalps i have are between 6 datasets. The experiment consisted of a treatment (Pperiod) with 3 levels (S, E & L) applied to 2 tissues (R & C). FYI targets file below if it helps. Each of the 6 datasets contain 16 time points on which i interrogated for transcripts which fit a sine curve and several other criteria, thus defining a list of 'rhythmic genes' for each of the 6 datasets. So an obvious question is what rhythmic transcripts are common between various combination's of the 6 data sets. Combination's being- Venn 1: Overlapping the 3 datasets of the 3 levels of treatment for tissue 'R' Venn 2: As above for tissue 'C' Venn 3: Overlapping 'R' and 'C' for treatment level 1 only. Venn 4: As for 3. for treatment level 2 only. Venn 5: As for 3. for treatment level 3 only. So what i meant by "non-independent gene lists" i think might apply to Venn 3, 4 and 5 given the fact that tissues 'R' & 'C' are obtained from the same animals, albeit 16 of them, and as time course's. But still, they can not strictly speaking be considered independent right? Which i thought some tests, including Fishers depend on. Knowing this, would you think the phyper function is the right one for my purpose. If so i'll plough on with the vindication of atleast the confidence that...some one with alot more experience on this than me recommends it! Again my thanks for engaging my query, Karl "RNA_Targets.txt"- FileName Tissue Pperiod Time Animal 01file.CEL R S 1 1 02file.CEL C S 1 1 03file.CEL R S 2 2 04file.CEL C S 2 2 05file.CEL R S 3 3 06file.CEL C S 3 3 07file.CEL R S 4 4 08file.CEL C S 4 4 09file.CEL R S 5 5 10file.CEL C S 5 5 11file.CEL R S 6 6 12file.CEL C S 6 6 13file.CEL R S 7 7 14file.CEL C S 7 7 15file.CEL R S 8 8 16file.CEL C S 8 8 17file.CEL R S 9 9 18file.CEL C S 9 9 19file.CEL R S 10 10 20file.CEL C S 10 10 21file.CEL R S 11 11 22file.CEL C S 11 11 23file.CEL R S 12 12 24file.CEL C S 12 12 25file.CEL R S 13 13 26file.CEL C S 13 13 27file.CEL R S 14 14 28file.CEL C S 14 14 29file.CEL R S 15 15 30file.CEL C S 15 15 31file.CEL R S 16 16 32file.CEL C S 16 16 33file.CEL R E 1 17 34file.CEL C E 1 17 35file.CEL R E 2 18 36file.CEL C E 2 18 37file.CEL R E 3 19 38file.CEL C E 3 19 39file.CEL R E 4 20 40file.CEL C E 4 20 41file.CEL R E 5 21 42file.CEL C E 5 21 43file.CEL R E 6 22 44file.CEL C E 6 22 45file.CEL R E 7 23 46file.CEL C E 7 23 47file.CEL R E 8 24 48file.CEL C E 8 24 49file.CEL R E 9 25 50file.CEL C E 9 25 51file.CEL R E 10 26 52file.CEL C E 10 26 53file.CEL R E 11 27 54file.CEL C E 11 27 55file.CEL R E 12 28 56file.CEL C E 12 28 57file.CEL R E 13 29 58file.CEL C E 13 29 59file.CEL R E 14 30 60file.CEL C E 14 30 61file.CEL R E 15 31 62file.CEL C E 15 31 63file.CEL R E 16 32 64file.CEL C E 16 32 65file.CEL R L 1 33 66file.CEL C L 1 33 67file.CEL R L 2 34 68file.CEL C L 2 34 69file.CEL R L 3 35 70file.CEL C L 3 35 71file.CEL R L 4 36 72file.CEL C L 4 36 73file.CEL R L 5 37 74file.CEL C L 5 37 75file.CEL R L 6 38 76file.CEL C L 6 38 77file.CEL R L 7 39 78file.CEL C L 7 39 79file.CEL R L 8 40 80file.CEL C L 8 40 81file.CEL R L 9 41 82file.CEL C L 9 41 83file.CEL R L 10 42 84file.CEL C L 10 42 85file.CEL R L 11 43 86file.CEL C L 11 43 87file.CEL R L 12 44 88file.CEL C L 12 44 89file.CEL R L 13 45 90file.CEL C L 13 45 91file.CEL R L 14 46 92file.CEL C L 14 46 93file.CEL R L 15 47 94file.CEL C L 15 47 95file.CEL R L 16 48 96file.CEL C L 16 48 On 3/17/2010 4:16 PM, Wolfgang Huber wrote: > Dear Karl > > [reposting to list] > > The bioinformatician was quicker, and provided a hack that "works", but > a statistician might have pointed out that the simulation scheme you > propose below is a needlessly poor and slow approximation of what the > hypergeometric distribution or the Fisher text would do faster and more > exactly. > > "Poor" because the distribution of count variables is (typically and in > particular in your case) not symmetric and using a standard deviation to > define a confidence interval and significance thresholds would ignore > that - i.e. give suboptimal results. > > Don't get me wrong - I think it's great when people are capable to > reinvent the wheel, but to get stuff done, using existing wheel designs > tends to be more productive. > > PS I am not sure what you mean by "non-independent gene lists". If you > already know that the lists are dependent, what exactly do you gain by > showing that their overlap is higher than if they were independent? > Isn't that tautological? > > Best wishes > Wolfgang > > > > Karl Brand scripsit 17/03/10 15:45: >> Cheers Wolfgang, >> >> Unfortuantly waiting on my local statistician also take's longer than >> using the calculator :( >> >> Discussion with a much more responsive bioifnormatician yielded the >> plan to employ a bootstrap/randomisation (terminology?!) approach. ie.: >> >> By using the same numbers of the chip-background probes (c. 45,000) >> and my short-list of probes of interest (c. 500), randomly selected >> and checking the overlap, performed say 10,000 times, an estimate of >> chance overlap could be obtained, along with a stardard deviation to >> which i could compare my actual results to for an estimate of >> significance, or p-value. >> >> Correct me if we're wrong but this seemed acceptable for Venns of >> non-independent gene lists. >> >> Coding this was what i was appealing for help on since my experience >> here is limiting. But, i'm definately up for a crack at it. I'll start >> by having a look at the "stats" package phyper. >> >> Again with appreciation for your prompt, thoughtful response, >> >> Karl >> >> On 3/17/2010 2:48 PM, Wolfgang Huber wrote: >>> Dear Karl, >>> >>> I don't think what you need here is necessarily a package - the required >>> computations, if possible, are one or a few lines of R using standard >>> functions e.g. in the "stats" package such as phyper. >>> >>> Perhaps the more important thing to do is to precisely define the >>> questions you want to be asking. For this, discussion with a local >>> statistician might be helpful. Once you have that, the answer will >>> probably be fairly obvious from a basic text book on combinatorics >>> (probability theory on discrete variables). >>> >>> Best wishes >>> Wolfgang >>> >>> >>> Karl Brand scripsit 17/03/10 12:26: >>>> Dear BioCers, >>>> >>>> I've got six lists of gene's which i'm focused on the overlaps between. >>>> >>>> What i'm searching for is a package or code to quantify the >>>> significance of the overlap between both a pair of gene lists, and >>>> also between three gene-lists. Six might be interesting, but not >>>> necessary. >>>> >>>> Specifically, what would the overlap be expected by chance, and how >>>> many standard deviations my actual overlap is from the estimated >>>> chance overlap? >>>> >>>> Whilst some of my lists are independent, others are not in being >>>> derived from tissues of the same origin. I understand this would >>>> exclude such tests like Fishers Rxact test which assume independence. >>>> >>>> By using the same numbers of chip-background probes and short- listed >>>> probes of interest, randomly selected and checking the overlap, >>>> performed say 10,000 times, i think i could obtain the estimates i'm >>>> looking for in a 'statistically acceptable' manner. >>>> >>>> Does anyone know of a package or code written for this purpose? I >>>> failed to find anything in BioConductor or in the BioC lists. As >>>> simple as coding it no doubt is, my lack of R knowledge would make >>>> doing it with a calculator the faster option :) >>>> >>>> Look forward to any recommendations or suggestions with appreciation, >>>> >>>> Karl >>>> >>>> >>> >>> >> > > -- Karl Brand k.brand-asperand-erasmusmc.nl Department of Genetics Erasmus MC Dr Molewaterplein 50 3015 GE Rotterdam lab +31 (0)10 704 3409 fax +31 (0)10 704 4743 mob +31 (0)642 777 268

ADD REPLY • link 14.1 years ago k. brand ▴ 420

0

Entering edit mode

Dear List, I tried the phyper function as follows: #phyper(overlaplistA&B-1, genelistA, totalprobesonchip-genelistA, genelistB, lower.tail = FALSE, log.p = FALSE) Of which the output seemed logical to me. But I'd really appreciate some ones patience and experience to confirm some concerns: -is it 'safe' to employ this test where genelistA and genelistB were obtained from AnimalX-tissue1 and AnimalX-tisse2 respectively? ie., do i violate any data independence issue's this test assumes? -the output Value is a 'distribution function'. Can i interpret this to be something like the 'likelihood that my observed result is due to chance alone'? -do in i need to subtract 1 from my 'overlap'? In the example i followed at tinyurl.com/ygtmefa this appaears to be the case, but the vignette has nothing on this. *most of all* how can i perform this test on three lists of overlapping gene's, not merely the two in this case? Maybes some one knows a hack/method to combine the 3 outputs (of three pairwise comparisons) for an estimate of the 3-way overlap? Even a conservative estimate would be better than nothing! With thanks in advance for thoughts and suggestions, cheers, Karl On 3/17/2010 5:16 PM, Karl Brand wrote: > Thank you Wolfgang, Madelaine, > > I'd rather not reinvent the wheel if i can help it. > > And if you you'll humor me a little longer, perhaps you can ensure i do > what you suggest correctly for my exact application. > > The overalps i have are between 6 datasets. The experiment consisted of > a treatment (Pperiod) with 3 levels (S, E & L) applied to 2 tissues (R & > C). FYI targets file below if it helps. Each of the 6 datasets contain > 16 time points on which i interrogated for transcripts which fit a sine > curve and several other criteria, thus defining a list of 'rhythmic > genes' for each of the 6 datasets. > > So an obvious question is what rhythmic transcripts are common between > various combination's of the 6 data sets. Combination's being- > > Venn 1: Overlapping the 3 datasets of the 3 levels of treatment for > tissue 'R' > Venn 2: As above for tissue 'C' > Venn 3: Overlapping 'R' and 'C' for treatment level 1 only. > Venn 4: As for 3. for treatment level 2 only. > Venn 5: As for 3. for treatment level 3 only. > > So what i meant by "non-independent gene lists" i think might apply to > Venn 3, 4 and 5 given the fact that tissues 'R' & 'C' are obtained from > the same animals, albeit 16 of them, and as time course's. But still, > they can not strictly speaking be considered independent right? Which i > thought some tests, including Fishers depend on. > > Knowing this, would you think the phyper function is the right one for > my purpose. If so i'll plough on with the vindication of atleast the > confidence that...some one with alot more experience on this than me > recommends it! > > Again my thanks for engaging my query, > > Karl > > > "RNA_Targets.txt"- > > FileName Tissue Pperiod Time Animal > 01file.CEL R S 1 1 > 02file.CEL C S 1 1 > 03file.CEL R S 2 2 > 04file.CEL C S 2 2 > 05file.CEL R S 3 3 > 06file.CEL C S 3 3 > 07file.CEL R S 4 4 > 08file.CEL C S 4 4 > 09file.CEL R S 5 5 > 10file.CEL C S 5 5 > 11file.CEL R S 6 6 > 12file.CEL C S 6 6 > 13file.CEL R S 7 7 > 14file.CEL C S 7 7 > 15file.CEL R S 8 8 > 16file.CEL C S 8 8 > 17file.CEL R S 9 9 > 18file.CEL C S 9 9 > 19file.CEL R S 10 10 > 20file.CEL C S 10 10 > 21file.CEL R S 11 11 > 22file.CEL C S 11 11 > 23file.CEL R S 12 12 > 24file.CEL C S 12 12 > 25file.CEL R S 13 13 > 26file.CEL C S 13 13 > 27file.CEL R S 14 14 > 28file.CEL C S 14 14 > 29file.CEL R S 15 15 > 30file.CEL C S 15 15 > 31file.CEL R S 16 16 > 32file.CEL C S 16 16 > 33file.CEL R E 1 17 > 34file.CEL C E 1 17 > 35file.CEL R E 2 18 > 36file.CEL C E 2 18 > 37file.CEL R E 3 19 > 38file.CEL C E 3 19 > 39file.CEL R E 4 20 > 40file.CEL C E 4 20 > 41file.CEL R E 5 21 > 42file.CEL C E 5 21 > 43file.CEL R E 6 22 > 44file.CEL C E 6 22 > 45file.CEL R E 7 23 > 46file.CEL C E 7 23 > 47file.CEL R E 8 24 > 48file.CEL C E 8 24 > 49file.CEL R E 9 25 > 50file.CEL C E 9 25 > 51file.CEL R E 10 26 > 52file.CEL C E 10 26 > 53file.CEL R E 11 27 > 54file.CEL C E 11 27 > 55file.CEL R E 12 28 > 56file.CEL C E 12 28 > 57file.CEL R E 13 29 > 58file.CEL C E 13 29 > 59file.CEL R E 14 30 > 60file.CEL C E 14 30 > 61file.CEL R E 15 31 > 62file.CEL C E 15 31 > 63file.CEL R E 16 32 > 64file.CEL C E 16 32 > 65file.CEL R L 1 33 > 66file.CEL C L 1 33 > 67file.CEL R L 2 34 > 68file.CEL C L 2 34 > 69file.CEL R L 3 35 > 70file.CEL C L 3 35 > 71file.CEL R L 4 36 > 72file.CEL C L 4 36 > 73file.CEL R L 5 37 > 74file.CEL C L 5 37 > 75file.CEL R L 6 38 > 76file.CEL C L 6 38 > 77file.CEL R L 7 39 > 78file.CEL C L 7 39 > 79file.CEL R L 8 40 > 80file.CEL C L 8 40 > 81file.CEL R L 9 41 > 82file.CEL C L 9 41 > 83file.CEL R L 10 42 > 84file.CEL C L 10 42 > 85file.CEL R L 11 43 > 86file.CEL C L 11 43 > 87file.CEL R L 12 44 > 88file.CEL C L 12 44 > 89file.CEL R L 13 45 > 90file.CEL C L 13 45 > 91file.CEL R L 14 46 > 92file.CEL C L 14 46 > 93file.CEL R L 15 47 > 94file.CEL C L 15 47 > 95file.CEL R L 16 48 > 96file.CEL C L 16 48 > > > > > > On 3/17/2010 4:16 PM, Wolfgang Huber wrote: >> Dear Karl >> >> [reposting to list] >> >> The bioinformatician was quicker, and provided a hack that "works", but >> a statistician might have pointed out that the simulation scheme you >> propose below is a needlessly poor and slow approximation of what the >> hypergeometric distribution or the Fisher text would do faster and more >> exactly. >> >> "Poor" because the distribution of count variables is (typically and in >> particular in your case) not symmetric and using a standard deviation to >> define a confidence interval and significance thresholds would ignore >> that - i.e. give suboptimal results. >> >> Don't get me wrong - I think it's great when people are capable to >> reinvent the wheel, but to get stuff done, using existing wheel designs >> tends to be more productive. >> >> PS I am not sure what you mean by "non-independent gene lists". If you >> already know that the lists are dependent, what exactly do you gain by >> showing that their overlap is higher than if they were independent? >> Isn't that tautological? >> >> Best wishes >> Wolfgang >> >> >> >> Karl Brand scripsit 17/03/10 15:45: >>> Cheers Wolfgang, >>> >>> Unfortuantly waiting on my local statistician also take's longer than >>> using the calculator :( >>> >>> Discussion with a much more responsive bioifnormatician yielded the >>> plan to employ a bootstrap/randomisation (terminology?!) approach. ie.: >>> >>> By using the same numbers of the chip-background probes (c. 45,000) >>> and my short-list of probes of interest (c. 500), randomly selected >>> and checking the overlap, performed say 10,000 times, an estimate of >>> chance overlap could be obtained, along with a stardard deviation to >>> which i could compare my actual results to for an estimate of >>> significance, or p-value. >>> >>> Correct me if we're wrong but this seemed acceptable for Venns of >>> non-independent gene lists. >>> >>> Coding this was what i was appealing for help on since my experience >>> here is limiting. But, i'm definately up for a crack at it. I'll start >>> by having a look at the "stats" package phyper. >>> >>> Again with appreciation for your prompt, thoughtful response, >>> >>> Karl >>> >>> On 3/17/2010 2:48 PM, Wolfgang Huber wrote: >>>> Dear Karl, >>>> >>>> I don't think what you need here is necessarily a package - the >>>> required >>>> computations, if possible, are one or a few lines of R using standard >>>> functions e.g. in the "stats" package such as phyper. >>>> >>>> Perhaps the more important thing to do is to precisely define the >>>> questions you want to be asking. For this, discussion with a local >>>> statistician might be helpful. Once you have that, the answer will >>>> probably be fairly obvious from a basic text book on combinatorics >>>> (probability theory on discrete variables). >>>> >>>> Best wishes >>>> Wolfgang >>>> >>>> >>>> Karl Brand scripsit 17/03/10 12:26: >>>>> Dear BioCers, >>>>> >>>>> I've got six lists of gene's which i'm focused on the overlaps >>>>> between. >>>>> >>>>> What i'm searching for is a package or code to quantify the >>>>> significance of the overlap between both a pair of gene lists, and >>>>> also between three gene-lists. Six might be interesting, but not >>>>> necessary. >>>>> >>>>> Specifically, what would the overlap be expected by chance, and how >>>>> many standard deviations my actual overlap is from the estimated >>>>> chance overlap? >>>>> >>>>> Whilst some of my lists are independent, others are not in being >>>>> derived from tissues of the same origin. I understand this would >>>>> exclude such tests like Fishers Rxact test which assume independence. >>>>> >>>>> By using the same numbers of chip-background probes and short- listed >>>>> probes of interest, randomly selected and checking the overlap, >>>>> performed say 10,000 times, i think i could obtain the estimates i'm >>>>> looking for in a 'statistically acceptable' manner. >>>>> >>>>> Does anyone know of a package or code written for this purpose? I >>>>> failed to find anything in BioConductor or in the BioC lists. As >>>>> simple as coding it no doubt is, my lack of R knowledge would make >>>>> doing it with a calculator the faster option :) >>>>> >>>>> Look forward to any recommendations or suggestions with appreciation, >>>>> >>>>> Karl >>>>> >>>>> >>>> >>>> >>> >> >> > -- Karl Brand k.brand-asperand-erasmusmc.nl Department of Genetics Erasmus MC Dr Molewaterplein 50 3015 GE Rotterdam lab +31 (0)10 704 3409 fax +31 (0)10 704 4743 mob +31 (0)642 777 268

ADD REPLY • link 14.1 years ago k. brand ▴ 420

Login before adding your answer.