help with multiple testing

0

Entering edit mode

Efthimios MOTAKIS ▴ 20

@efthimios-motakis-4986

Last seen 10.6 years ago

Hi all, My name is Mike and I am a post-doctoral fellow in Bioinformatics. I have a question regarding multiple testing p-values adjustment and I wonder if someone could give me a piece of advice. I have multiple gene pairs (approximately 8,256) composed by all possible combinations of 129 genes. For each pair A-B (A different from B) four values are recorded: number of tumors found in both A and B (TT), number of tumors only in A (TF), number of tumors only in B (FT), number of tumors found neither in A nor in B (FF). The data are in the form of 2x2 contingency tables. E.g. Gene 1 Gene 2 TT TF FT FF g1 g2 5 1 1 27 g1 g3 4 1 1 28 g2 g3 4 2 0 28 ... ... ... Notice that each gene is paired with all others and thus it is represented 128 times in this list. I want to find which of the 8,256 gene pairs (tests) show significant associations between rows (in A, not in A) and columns (in B, not in B) by Fisher or Barnard test. Subsequently I have to perform p-value adjustment for multiple testing. At 5% I find approximately 500 significant gene pairs but, naturally, all p-value adjustment procedures I tried (for independent tests: BH, q-value; for dependent tests: BY, adaptiveBH and BlaRoq from package "multtest") produce adj. p-values > 0.3. I think that the problem is that the highly dependent nature of the data (50% of the genes have very small number of mutations which gives high p-values for all pair they generate) affects dramatically the adjustment procedure. Is there a better way (method) to run the p-values adjustment? Do you think if I created multiple lists of gene pairs, where each gene is represented only once, and then estimate q-value (multiple q-values for each pair) would be an appropriate solution? Thank you, Mike

• 1.2k views

ADD COMMENT • link updated 12.8 years ago by Wolfgang Huber ★ 13k • written 12.8 years ago by Efthimios MOTAKIS ▴ 20

0

Entering edit mode

yao chen ▴ 210

@yao-chen-5205

Last seen 10.6 years ago

Hi Mike, I think another reason is the small sample size and many gene pairs.So randomly significant pairs would be expect which generate high FDR. I don't know if there is a better solution. I would choose top ranking genes with big fold change and small p value. Jack 2012/6/25 efthimiosm <efthimiosm@bii.a-star.edu.sg> > Hi all, > > My name is Mike and I am a post-doctoral fellow in Bioinformatics. I have > a question regarding multiple testing p-values adjustment and I wonder if > someone could give me a piece of advice. > > I have multiple gene pairs (approximately 8,256) composed by all possible > combinations of 129 genes. For each pair A-B (A different from B) four > values are recorded: number of tumors found in both A and B (TT), number > of tumors only in A (TF), number of tumors only in B (FT), number of tumors > found neither in A nor in B (FF). The data are in the form of 2x2 > contingency tables. E.g. > > Gene 1 Gene 2 TT TF FT FF > g1 g2 5 1 1 27 > g1 g3 4 1 1 28 > g2 g3 4 2 0 28 > ... > ... > ... > > Notice that each gene is paired with all others and thus it is represented > 128 times in this list. I want to find which of the 8,256 gene pairs > (tests) show significant associations between rows (in A, not in A) and > columns (in B, not in B) by Fisher or Barnard test. Subsequently I have to > perform p-value adjustment for multiple testing. > > At 5% I find approximately 500 significant gene pairs but, naturally, all > p-value adjustment procedures I tried (for independent tests: BH, q-value; > for dependent tests: BY, adaptiveBH and BlaRoq from package "multtest") > produce adj. p-values > 0.3. I think that the problem is that the highly > dependent nature of the data (50% of the genes have very small number of > mutations which gives high p-values for all pair they generate) affects > dramatically the adjustment procedure. > > Is there a better way (method) to run the p-values adjustment? > > Do you think if I created multiple lists of gene pairs, where each gene is > represented only once, and then estimate q-value (multiple q-values for > each pair) would be an appropriate solution? > > > Thank you, > Mike > > ______________________________**_________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.et="" hz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: http://news.gmane.org/gmane.** > science.biology.informatics.**conductor<http: news.gmane.org="" gmane.="" science.biology.informatics.conductor=""> > [[alternative HTML version deleted]]

ADD COMMENT • link 12.8 years ago yao chen ▴ 210

0

Entering edit mode

Wolfgang Huber ★ 13k

@wolfgang-huber-3550

Last seen 5 weeks ago

EMBL European Molecular Biology Laborat…

Dear Mike I'd be surprised if this problem were cracked by a brute force purely 'statistical' approach. You could try to reduce the number of tests by first grouping the genes into 'pathways' or functional modules. With a lot of luck, the data may then just be large enough. Besy wishes Wolfgang Jun/25/12 1:15 PM, efthimiosm scripsit:: > Hi all, > > My name is Mike and I am a post-doctoral fellow in Bioinformatics. I > have a question regarding multiple testing p-values adjustment and I > wonder if someone could give me a piece of advice. > > I have multiple gene pairs (approximately 8,256) composed by all > possible combinations of 129 genes. For each pair A-B (A different from > B) four values are recorded: number of tumors found in both A and B > (TT), number of tumors only in A (TF), number of tumors only in B (FT), > number of tumors found neither in A nor in B (FF). The data are in the > form of 2x2 contingency tables. E.g. > > Gene 1 Gene 2 TT TF FT FF > g1 g2 5 1 1 27 > g1 g3 4 1 1 28 > g2 g3 4 2 0 28 > ... > ... > ... > > Notice that each gene is paired with all others and thus it is > represented 128 times in this list. I want to find which of the 8,256 > gene pairs (tests) show significant associations between rows (in A, not > in A) and columns (in B, not in B) by Fisher or Barnard test. > Subsequently I have to perform p-value adjustment for multiple testing. > > At 5% I find approximately 500 significant gene pairs but, naturally, > all p-value adjustment procedures I tried (for independent tests: BH, > q-value; for dependent tests: BY, adaptiveBH and BlaRoq from package > "multtest") produce adj. p-values > 0.3. I think that the problem is > that the highly dependent nature of the data (50% of the genes have very > small number of mutations which gives high p-values for all pair they > generate) affects dramatically the adjustment procedure. > > Is there a better way (method) to run the p-values adjustment? > > Do you think if I created multiple lists of gene pairs, where each gene > is represented only once, and then estimate q-value (multiple q-values > for each pair) would be an appropriate solution? > > > Thank you, > Mike > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- Best wishes Wolfgang Wolfgang Huber EMBL http://www.embl.de/research/units/genome_biology/huber

ADD COMMENT • link 12.8 years ago Wolfgang Huber ★ 13k

Login before adding your answer.