function similar to phyper function that can handle 3 or more gene lists
0
0
Entering edit mode
k. brand ▴ 420
@k-brand-1874
Last seen 9.9 years ago
Dear List, This is a repost of- Re: [BioC] package or code to quantify the significance of the venn overlap between 2 or 3 lists of genes -with a related, but new question born of my success using phyper. I employed phyper to estimate the likelihood that the number of genes overlapping between 2 different lists of genes is due to chance. I need to do the same with 3 lists of genes which phyper doesn't appear capable of. Can anyone recommend a function or share a script which might achieve this? Previous post/discussion below if it helps. With thanks in advance, cheers, Karl > -------- Original Message -------- > Subject: Re: [BioC] package or code to quantify the significance of the > venn overlap between 2 or 3 lists of genes > Date: Thu, 18 Mar 2010 17:25:14 +0100 > From: Karl Brand<k.brand at="" erasmusmc.nl=""> > To: bioconductor at stat.math.ethz.ch<bioconductor at="" stat.math.ethz.ch=""> > CC: Wolfgang Huber<whuber at="" embl.de="">, MCM at stowers.org, seandavi at gmail.com > > Dear List, > > I tried the phyper function as follows: > > #phyper(overlaplistA&B-1, genelistA, totalprobesonchip-genelistA, > genelistB, lower.tail = FALSE, log.p = FALSE) > > Of which the output seemed logical to me. But I'd really appreciate some > ones patience and experience to confirm some concerns: > > -is it 'safe' to employ this test where genelistA and genelistB were > obtained from AnimalX-tissue1 and AnimalX-tisse2 respectively? ie., do i > violate any data independence issue's this test assumes? > > -the output Value is a 'distribution function'. Can i interpret this to > be something like the 'likelihood that my observed result is due to > chance alone'? > > -do in i need to subtract 1 from my 'overlap'? In the example i followed > at tinyurl.com/ygtmefa this appaears to be the case, but the vignette > has nothing on this. > > *most of all* how can i perform this test on three lists of overlapping > gene's, not merely the two in this case? Maybes some one knows a > hack/method to combine the 3 outputs (of three pairwise comparisons) for > an estimate of the 3-way overlap? Even a conservative estimate would be > better than nothing! > > With thanks in advance for thoughts and suggestions, cheers, > > Karl > > > > On 3/17/2010 5:16 PM, Karl Brand wrote: >> Thank you Wolfgang, Madelaine, >> >> I'd rather not reinvent the wheel if i can help it. >> >> And if you you'll humor me a little longer, perhaps you can ensure i do >> what you suggest correctly for my exact application. >> >> The overalps i have are between 6 datasets. The experiment consisted of >> a treatment (Pperiod) with 3 levels (S, E& L) applied to 2 tissues (R& >> C). FYI targets file below if it helps. Each of the 6 datasets contain >> 16 time points on which i interrogated for transcripts which fit a sine >> curve and several other criteria, thus defining a list of 'rhythmic >> genes' for each of the 6 datasets. >> >> So an obvious question is what rhythmic transcripts are common between >> various combination's of the 6 data sets. Combination's being- >> >> Venn 1: Overlapping the 3 datasets of the 3 levels of treatment for >> tissue 'R' >> Venn 2: As above for tissue 'C' >> Venn 3: Overlapping 'R' and 'C' for treatment level 1 only. >> Venn 4: As for 3. for treatment level 2 only. >> Venn 5: As for 3. for treatment level 3 only. >> >> So what i meant by "non-independent gene lists" i think might apply to >> Venn 3, 4 and 5 given the fact that tissues 'R'& 'C' are obtained from >> the same animals, albeit 16 of them, and as time course's. But still, >> they can not strictly speaking be considered independent right? Which i >> thought some tests, including Fishers depend on. >> >> Knowing this, would you think the phyper function is the right one for >> my purpose. If so i'll plough on with the vindication of atleast the >> confidence that...some one with alot more experience on this than me >> recommends it! >> >> Again my thanks for engaging my query, >> >> Karl >> >> >> "RNA_Targets.txt"- >> >> FileName Tissue Pperiod Time Animal >> 01file.CEL R S 1 1 >> 02file.CEL C S 1 1 >> 03file.CEL R S 2 2 >> 04file.CEL C S 2 2 >> 05file.CEL R S 3 3 >> 06file.CEL C S 3 3 >> 07file.CEL R S 4 4 >> 08file.CEL C S 4 4 >> 09file.CEL R S 5 5 >> 10file.CEL C S 5 5 >> 11file.CEL R S 6 6 >> 12file.CEL C S 6 6 >> 13file.CEL R S 7 7 >> 14file.CEL C S 7 7 >> 15file.CEL R S 8 8 >> 16file.CEL C S 8 8 >> 17file.CEL R S 9 9 >> 18file.CEL C S 9 9 >> 19file.CEL R S 10 10 >> 20file.CEL C S 10 10 >> 21file.CEL R S 11 11 >> 22file.CEL C S 11 11 >> 23file.CEL R S 12 12 >> 24file.CEL C S 12 12 >> 25file.CEL R S 13 13 >> 26file.CEL C S 13 13 >> 27file.CEL R S 14 14 >> 28file.CEL C S 14 14 >> 29file.CEL R S 15 15 >> 30file.CEL C S 15 15 >> 31file.CEL R S 16 16 >> 32file.CEL C S 16 16 >> 33file.CEL R E 1 17 >> 34file.CEL C E 1 17 >> 35file.CEL R E 2 18 >> 36file.CEL C E 2 18 >> 37file.CEL R E 3 19 >> 38file.CEL C E 3 19 >> 39file.CEL R E 4 20 >> 40file.CEL C E 4 20 >> 41file.CEL R E 5 21 >> 42file.CEL C E 5 21 >> 43file.CEL R E 6 22 >> 44file.CEL C E 6 22 >> 45file.CEL R E 7 23 >> 46file.CEL C E 7 23 >> 47file.CEL R E 8 24 >> 48file.CEL C E 8 24 >> 49file.CEL R E 9 25 >> 50file.CEL C E 9 25 >> 51file.CEL R E 10 26 >> 52file.CEL C E 10 26 >> 53file.CEL R E 11 27 >> 54file.CEL C E 11 27 >> 55file.CEL R E 12 28 >> 56file.CEL C E 12 28 >> 57file.CEL R E 13 29 >> 58file.CEL C E 13 29 >> 59file.CEL R E 14 30 >> 60file.CEL C E 14 30 >> 61file.CEL R E 15 31 >> 62file.CEL C E 15 31 >> 63file.CEL R E 16 32 >> 64file.CEL C E 16 32 >> 65file.CEL R L 1 33 >> 66file.CEL C L 1 33 >> 67file.CEL R L 2 34 >> 68file.CEL C L 2 34 >> 69file.CEL R L 3 35 >> 70file.CEL C L 3 35 >> 71file.CEL R L 4 36 >> 72file.CEL C L 4 36 >> 73file.CEL R L 5 37 >> 74file.CEL C L 5 37 >> 75file.CEL R L 6 38 >> 76file.CEL C L 6 38 >> 77file.CEL R L 7 39 >> 78file.CEL C L 7 39 >> 79file.CEL R L 8 40 >> 80file.CEL C L 8 40 >> 81file.CEL R L 9 41 >> 82file.CEL C L 9 41 >> 83file.CEL R L 10 42 >> 84file.CEL C L 10 42 >> 85file.CEL R L 11 43 >> 86file.CEL C L 11 43 >> 87file.CEL R L 12 44 >> 88file.CEL C L 12 44 >> 89file.CEL R L 13 45 >> 90file.CEL C L 13 45 >> 91file.CEL R L 14 46 >> 92file.CEL C L 14 46 >> 93file.CEL R L 15 47 >> 94file.CEL C L 15 47 >> 95file.CEL R L 16 48 >> 96file.CEL C L 16 48 >> >> >> >> >> >> On 3/17/2010 4:16 PM, Wolfgang Huber wrote: >>> Dear Karl >>> >>> [reposting to list] >>> >>> The bioinformatician was quicker, and provided a hack that "works", but >>> a statistician might have pointed out that the simulation scheme you >>> propose below is a needlessly poor and slow approximation of what the >>> hypergeometric distribution or the Fisher text would do faster and more >>> exactly. >>> >>> "Poor" because the distribution of count variables is (typically and in >>> particular in your case) not symmetric and using a standard deviation to >>> define a confidence interval and significance thresholds would ignore >>> that - i.e. give suboptimal results. >>> >>> Don't get me wrong - I think it's great when people are capable to >>> reinvent the wheel, but to get stuff done, using existing wheel designs >>> tends to be more productive. >>> >>> PS I am not sure what you mean by "non-independent gene lists". If you >>> already know that the lists are dependent, what exactly do you gain by >>> showing that their overlap is higher than if they were independent? >>> Isn't that tautological? >>> >>> Best wishes >>> Wolfgang >>> >>> >>> >>> Karl Brand scripsit 17/03/10 15:45: >>>> Cheers Wolfgang, >>>> >>>> Unfortuantly waiting on my local statistician also take's longer than >>>> using the calculator :( >>>> >>>> Discussion with a much more responsive bioifnormatician yielded the >>>> plan to employ a bootstrap/randomisation (terminology?!) approach. ie.: >>>> >>>> By using the same numbers of the chip-background probes (c. 45,000) >>>> and my short-list of probes of interest (c. 500), randomly selected >>>> and checking the overlap, performed say 10,000 times, an estimate of >>>> chance overlap could be obtained, along with a stardard deviation to >>>> which i could compare my actual results to for an estimate of >>>> significance, or p-value. >>>> >>>> Correct me if we're wrong but this seemed acceptable for Venns of >>>> non-independent gene lists. >>>> >>>> Coding this was what i was appealing for help on since my experience >>>> here is limiting. But, i'm definately up for a crack at it. I'll start >>>> by having a look at the "stats" package phyper. >>>> >>>> Again with appreciation for your prompt, thoughtful response, >>>> >>>> Karl >>>> >>>> On 3/17/2010 2:48 PM, Wolfgang Huber wrote: >>>>> Dear Karl, >>>>> >>>>> I don't think what you need here is necessarily a package - the >>>>> required >>>>> computations, if possible, are one or a few lines of R using standard >>>>> functions e.g. in the "stats" package such as phyper. >>>>> >>>>> Perhaps the more important thing to do is to precisely define the >>>>> questions you want to be asking. For this, discussion with a local >>>>> statistician might be helpful. Once you have that, the answer will >>>>> probably be fairly obvious from a basic text book on combinatorics >>>>> (probability theory on discrete variables). >>>>> >>>>> Best wishes >>>>> Wolfgang >>>>> >>>>> >>>>> Karl Brand scripsit 17/03/10 12:26: >>>>>> Dear BioCers, >>>>>> >>>>>> I've got six lists of gene's which i'm focused on the overlaps >>>>>> between. >>>>>> >>>>>> What i'm searching for is a package or code to quantify the >>>>>> significance of the overlap between both a pair of gene lists, and >>>>>> also between three gene-lists. Six might be interesting, but not >>>>>> necessary. >>>>>> >>>>>> Specifically, what would the overlap be expected by chance, and how >>>>>> many standard deviations my actual overlap is from the estimated >>>>>> chance overlap? >>>>>> >>>>>> Whilst some of my lists are independent, others are not in being >>>>>> derived from tissues of the same origin. I understand this would >>>>>> exclude such tests like Fishers Rxact test which assume independence. >>>>>> >>>>>> By using the same numbers of chip-background probes and short- listed >>>>>> probes of interest, randomly selected and checking the overlap, >>>>>> performed say 10,000 times, i think i could obtain the estimates i'm >>>>>> looking for in a 'statistically acceptable' manner. >>>>>> >>>>>> Does anyone know of a package or code written for this purpose? I >>>>>> failed to find anything in BioConductor or in the BioC lists. As >>>>>> simple as coding it no doubt is, my lack of R knowledge would make >>>>>> doing it with a calculator the faster option :) >>>>>> >>>>>> Look forward to any recommendations or suggestions with appreciation, >>>>>> >>>>>> Karl >>>>>> >>>>>> >>>>> >>>>> >>>> >>> >>> >> > -- Karl Brand k.brand-asperand-erasmusmc.nl Department of Genetics Erasmus MC Dr Molewaterplein 50 3015 GE Rotterdam lab +31 (0)10 704 3409 fax +31 (0)10 704 4743 mob +31 (0)642 777 268
• 781 views
ADD COMMENT

Login before adding your answer.

Traffic: 779 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6