Question

function similar to phyper function that can handle 3 or more gene lists

0

Entering edit mode

k. brand ▴ 420

@k-brand-1874

Last seen 10.1 years ago

Dear List, This is a repost of- Re: [BioC] package or code to quantify the significance of the venn overlap between 2 or 3 lists of genes -with a related, but new question born of my success using phyper. I employed phyper to estimate the likelihood that the number of genes overlapping between 2 different lists of genes is due to chance. I need to do the same with 3 lists of genes which phyper doesn't appear capable of. Can anyone recommend a function or share a script which might achieve this? Previous post/discussion below if it helps. With thanks in advance, cheers, Karl > -------- Original Message -------- > Subject: Re: [BioC] package or code to quantify the significance of the > venn overlap between 2 or 3 lists of genes > Date: Thu, 18 Mar 2010 17:25:14 +0100 > From: Karl Brand<k.brand at="" erasmusmc.nl=""> > To: bioconductor at stat.math.ethz.ch<bioconductor at="" stat.math.ethz.ch=""> > CC: Wolfgang Huber<whuber at="" embl.de="">, MCM at stowers.org, seandavi at gmail.com > > Dear List, > > I tried the phyper function as follows: > > #phyper(overlaplistA&B-1, genelistA, totalprobesonchip-genelistA, > genelistB, lower.tail = FALSE, log.p = FALSE) > > Of which the output seemed logical to me. But I'd really appreciate some > ones patience and experience to confirm some concerns: > > -is it 'safe' to employ this test where genelistA and genelistB were > obtained from AnimalX-tissue1 and AnimalX-tisse2 respectively? ie., do i > violate any data independence issue's this test assumes? > > -the output Value is a 'distribution function'. Can i interpret this to > be something like the 'likelihood that my observed result is due to > chance alone'? > > -do in i need to subtract 1 from my 'overlap'? In the example i followed > at tinyurl.com/ygtmefa this appaears to be the case, but the vignette > has nothing on this. > > *most of all* how can i perform this test on three lists of overlapping > gene's, not merely the two in this case? Maybes some one knows a > hack/method to combine the 3 outputs (of three pairwise comparisons) for > an estimate of the 3-way overlap? Even a conservative estimate would be > better than nothing! > > With thanks in advance for thoughts and suggestions, cheers, > > Karl > > > > On 3/17/2010 5:16 PM, Karl Brand wrote: >> Thank you Wolfgang, Madelaine, >> >> I'd rather not reinvent the wheel if i can help it. >> >> And if you you'll humor me a little longer, perhaps you can ensure i do >> what you suggest correctly for my exact application. >> >> The overalps i have are between 6 datasets. The experiment consisted of >> a treatment (Pperiod) with 3 levels (S, E& L) applied to 2 tissues (R& >> C). FYI targets file below if it helps. Each of the 6 datasets contain >> 16 time points on which i interrogated for transcripts which fit a sine >> curve and several other criteria, thus defining a list of 'rhythmic >> genes' for each of the 6 datasets. >> >> So an obvious question is what rhythmic transcripts are common between >> various combination's of the 6 data sets. Combination's being- >> >> Venn 1: Overlapping the 3 datasets of the 3 levels of treatment for >> tissue 'R' >> Venn 2: As above for tissue 'C' >> Venn 3: Overlapping 'R' and 'C' for treatment level 1 only. >> Venn 4: As for 3. for treatment level 2 only. >> Venn 5: As for 3. for treatment level 3 only. >> >> So what i meant by "non-independent gene lists" i think might apply to >> Venn 3, 4 and 5 given the fact that tissues 'R'& 'C' are obtained from >> the same animals, albeit 16 of them, and as time course's. But still, >> they can not strictly speaking be considered independent right? Which i >> thought some tests, including Fishers depend on. >> >> Knowing this, would you think the phyper function is the right one for >> my purpose. If so i'll plough on with the vindication of atleast the >> confidence that...some one with alot more experience on this than me >> recommends it! >> >> Again my thanks for engaging my query, >> >> Karl >> >> >> "RNA_Targets.txt"- >> >> FileName Tissue Pperiod Time Animal >> 01file.CEL R S 1 1 >> 02file.CEL C S 1 1 >> 03file.CEL R S 2 2 >> 04file.CEL C S 2 2 >> 05file.CEL R S 3 3 >> 06file.CEL C S 3 3 >> 07file.CEL R S 4 4 >> 08file.CEL C S 4 4 >> 09file.CEL R S 5 5 >> 10file.CEL C S 5 5 >> 11file.CEL R S 6 6 >> 12file.CEL C S 6 6 >> 13file.CEL R S 7 7 >> 14file.CEL C S 7 7 >> 15file.CEL R S 8 8 >> 16file.CEL C S 8 8 >> 17file.CEL R S 9 9 >> 18file.CEL C S 9 9 >> 19file.CEL R S 10 10 >> 20file.CEL C S 10 10 >> 21file.CEL R S 11 11 >> 22file.CEL C S 11 11 >> 23file.CEL R S 12 12 >> 24file.CEL C S 12 12 >> 25file.CEL R S 13 13 >> 26file.CEL C S 13 13 >> 27file.CEL R S 14 14 >> 28file.CEL C S 14 14 >> 29file.CEL R S 15 15 >> 30file.CEL C S 15 15 >> 31file.CEL R S 16 16 >> 32file.CEL C S 16 16 >> 33file.CEL R E 1 17 >> 34file.CEL C E 1 17 >> 35file.CEL R E 2 18 >> 36file.CEL C E 2 18 >> 37file.CEL R E 3 19 >> 38file.CEL C E 3 19 >> 39file.CEL R E 4 20 >> 40file.CEL C E 4 20 >> 41file.CEL R E 5 21 >> 42file.CEL C E 5 21 >> 43file.CEL R E 6 22 >> 44file.CEL C E 6 22 >> 45file.CEL R E 7 23 >> 46file.CEL C E 7 23 >> 47file.CEL R E 8 24 >> 48file.CEL C E 8 24 >> 49file.CEL R E 9 25 >> 50file.CEL C E 9 25 >> 51file.CEL R E 10 26 >> 52file.CEL C E 10 26 >> 53file.CEL R E 11 27 >> 54file.CEL C E 11 27 >> 55file.CEL R E 12 28 >> 56file.CEL C E 12 28 >> 57file.CEL R E 13 29 >> 58file.CEL C E 13 29 >> 59file.CEL R E 14 30 >> 60file.CEL C E 14 30 >> 61file.CEL R E 15 31 >> 62file.CEL C E 15 31 >> 63file.CEL R E 16 32 >> 64file.CEL C E 16 32 >> 65file.CEL R L 1 33 >> 66file.CEL C L 1 33 >> 67file.CEL R L 2 34 >> 68file.CEL C L 2 34 >> 69file.CEL R L 3 35 >> 70file.CEL C L 3 35 >> 71file.CEL R L 4 36 >> 72file.CEL C L 4 36 >> 73file.CEL R L 5 37 >> 74file.CEL C L 5 37 >> 75file.CEL R L 6 38 >> 76file.CEL C L 6 38 >> 77file.CEL R L 7 39 >> 78file.CEL C L 7 39 >> 79file.CEL R L 8 40 >> 80file.CEL C L 8 40 >> 81file.CEL R L 9 41 >> 82file.CEL C L 9 41 >> 83file.CEL R L 10 42 >> 84file.CEL C L 10 42 >> 85file.CEL R L 11 43 >> 86file.CEL C L 11 43 >> 87file.CEL R L 12 44 >> 88file.CEL C L 12 44 >> 89file.CEL R L 13 45 >> 90file.CEL C L 13 45 >> 91file.CEL R L 14 46 >> 92file.CEL C L 14 46 >> 93file.CEL R L 15 47 >> 94file.CEL C L 15 47 >> 95file.CEL R L 16 48 >> 96file.CEL C L 16 48 >> >> >> >> >> >> On 3/17/2010 4:16 PM, Wolfgang Huber wrote: >>> Dear Karl >>> >>> [reposting to list] >>> >>> The bioinformatician was quicker, and provided a hack that "works", but >>> a statistician might have pointed out that the simulation scheme you >>> propose below is a needlessly poor and slow approximation of what the >>> hypergeometric distribution or the Fisher text would do faster and more >>> exactly. >>> >>> "Poor" because the distribution of count variables is (typically and in >>> particular in your case) not symmetric and using a standard deviation to >>> define a confidence interval and significance thresholds would ignore >>> that - i.e. give suboptimal results. >>> >>> Don't get me wrong - I think it's great when people are capable to >>> reinvent the wheel, but to get stuff done, using existing wheel designs >>> tends to be more productive. >>> >>> PS I am not sure what you mean by "non-independent gene lists". If you >>> already know that the lists are dependent, what exactly do you gain by >>> showing that their overlap is higher than if they were independent? >>> Isn't that tautological? >>> >>> Best wishes >>> Wolfgang >>> >>> >>> >>> Karl Brand scripsit 17/03/10 15:45: >>>> Cheers Wolfgang, >>>> >>>> Unfortuantly waiting on my local statistician also take's longer than >>>> using the calculator :( >>>> >>>> Discussion with a much more responsive bioifnormatician yielded the >>>> plan to employ a bootstrap/randomisation (terminology?!) approach. ie.: >>>> >>>> By using the same numbers of the chip-background probes (c. 45,000) >>>> and my short-list of probes of interest (c. 500), randomly selected >>>> and checking the overlap, performed say 10,000 times, an estimate of >>>> chance overlap could be obtained, along with a stardard deviation to >>>> which i could compare my actual results to for an estimate of >>>> significance, or p-value. >>>> >>>> Correct me if we're wrong but this seemed acceptable for Venns of >>>> non-independent gene lists. >>>> >>>> Coding this was what i was appealing for help on since my experience >>>> here is limiting. But, i'm definately up for a crack at it. I'll start >>>> by having a look at the "stats" package phyper. >>>> >>>> Again with appreciation for your prompt, thoughtful response, >>>> >>>> Karl >>>> >>>> On 3/17/2010 2:48 PM, Wolfgang Huber wrote: >>>>> Dear Karl, >>>>> >>>>> I don't think what you need here is necessarily a package - the >>>>> required >>>>> computations, if possible, are one or a few lines of R using standard >>>>> functions e.g. in the "stats" package such as phyper. >>>>> >>>>> Perhaps the more important thing to do is to precisely define the >>>>> questions you want to be asking. For this, discussion with a local >>>>> statistician might be helpful. Once you have that, the answer will >>>>> probably be fairly obvious from a basic text book on combinatorics >>>>> (probability theory on discrete variables). >>>>> >>>>> Best wishes >>>>> Wolfgang >>>>> >>>>> >>>>> Karl Brand scripsit 17/03/10 12:26: >>>>>> Dear BioCers, >>>>>> >>>>>> I've got six lists of gene's which i'm focused on the overlaps >>>>>> between. >>>>>> >>>>>> What i'm searching for is a package or code to quantify the >>>>>> significance of the overlap between both a pair of gene lists, and >>>>>> also between three gene-lists. Six might be interesting, but not >>>>>> necessary. >>>>>> >>>>>> Specifically, what would the overlap be expected by chance, and how >>>>>> many standard deviations my actual overlap is from the estimated >>>>>> chance overlap? >>>>>> >>>>>> Whilst some of my lists are independent, others are not in being >>>>>> derived from tissues of the same origin. I understand this would >>>>>> exclude such tests like Fishers Rxact test which assume independence. >>>>>> >>>>>> By using the same numbers of chip-background probes and short- listed >>>>>> probes of interest, randomly selected and checking the overlap, >>>>>> performed say 10,000 times, i think i could obtain the estimates i'm >>>>>> looking for in a 'statistically acceptable' manner. >>>>>> >>>>>> Does anyone know of a package or code written for this purpose? I >>>>>> failed to find anything in BioConductor or in the BioC lists. As >>>>>> simple as coding it no doubt is, my lack of R knowledge would make >>>>>> doing it with a calculator the faster option :) >>>>>> >>>>>> Look forward to any recommendations or suggestions with appreciation, >>>>>> >>>>>> Karl >>>>>> >>>>>> >>>>> >>>>> >>>> >>> >>> >> > -- Karl Brand k.brand-asperand-erasmusmc.nl Department of Genetics Erasmus MC Dr Molewaterplein 50 3015 GE Rotterdam lab +31 (0)10 704 3409 fax +31 (0)10 704 4743 mob +31 (0)642 777 268

• 820 views

ADD COMMENT • link 14.6 years ago k. brand ▴ 420