Selecting p-value cutoffs for differential expression

0

Entering edit mode

Matthew Hannah ▴ 940

@matthew-hannah-621

Last seen 9.6 years ago

Hi, Before the main question a minor point from the limma guide (as I'm using it to compute the p-values). In the swirl example there is the following sentence after the toptable is produced, are the stats not independent because there are duplicate spots, or is there another reason that I should be aware of? "Beware that the Benjamini and Hochberg method used to control the false discovery rate assumes independent statistics which we do not have here (see help(p.adjust))." Anyway, this aside. I'm looking to canvas opinion on how to select a p-value cutoff for genes that are differentially expressed, hopefully also allowing an assessment of false positive and negative rates aswell. I've been playing around with the following, but none seems satisfactory. Anyone have any input/experience on this topic? 1.Look at p-values for genes that are not called present in any of the arrays, I suspect some are slipping through as there is still a peak of low p-values. 2.Look at p-values for genes that have not been previously reported as regulated by the treatment - but most previous work is poorly replicated and has arbitary cutoffs such as 2 fold, so big peak of low p-values - not as big as for those that have been previously reported though - any ideas how to use this difference? 3.Use a set of control or house-keeping genes to define a lower cut- off - unfortunately some do respond to the treatment (also confirmed in previous work), so how to select appropriate genes... 4.As it seems that gcrma values have a bimodal distribution - any ideas on how to utilise the lower peak (that presumably represents 'absent' genes), to calculate a threshold. 5.Choose a fdr p-value of 0.01, 0.001 or 0.0001, assuming they are approximately giving you corresponding false positive rates? 6. 'Decide' how many genes you want to be differentially expressed, and then select one of the above criteria appropriately, this obviously works as you'd like ;-) but is tricky to justify! Cheers, Matt

limma gcrma limma gcrma • 3.7k views

ADD COMMENT • link updated 19.6 years ago by Suresh Gopalan ▴ 20 • written 19.6 years ago by Matthew Hannah ▴ 940

0

Entering edit mode

Ramon Diaz ★ 1.1k

@ramon-diaz-159

Last seen 9.6 years ago

On Tuesday 07 September 2004 17:20, Matthew Hannah wrote: > Hi, > > Before the main question a minor point from the limma guide (as I'm > using it to compute the p-values). In the swirl example there is the > following sentence after the toptable is produced, are the stats not > independent because there are duplicate spots, or is there another > reason that I should be aware of? > > "Beware that the Benjamini and Hochberg method used to control the false > discovery rate assumes independent statistics which we do not have here > (see help(p.adjust))." Actually, the control also holds for "positive regression dependency" (details in the Benjamini and Yekutilei 2001 paper) which some people argue is actually what is common with microarray data, because of the "(...) tendency of measurement errors of gene expressions to be positively correlated (...)" (p. 370 in Reiner, Yekutieli and Benjamini, 2003, Bioinformatics, 19 (3): 368--375). Anyway, the results in the paper by Reiner et al. show (convincigly to me, at least) that using the BH procedure we do control the FDR at the desired level. Briefly, then, I do not worry a lot about the non-indep. of the statistics when I use BH with microarray data. > > Anyway, this aside. I'm looking to canvas opinion on how to select a > p-value cutoff for genes that are differentially expressed, hopefully > also allowing an assessment of false positive and negative rates aswell. > I've been playing around with the following, but none seems > satisfactory. Anyone have any input/experience on this topic? This is probably a useless answer, but I think a crucial issue is the objective of the study. If those p-values are used to select a set of 50 genes for RT-PCR where you have to spend pre-allocated budget for exactly 50 genes well, chose the top 50. But if you will only continue with a follow up if the evidence is "strong enough", then you will want to weight, somehow, what strong is compared to the costs on not doing the follow up on some hiden gem with not-low-enough-p. And I think we have to ponder those issues in relation to other sources of error (e.g., is your statistical model ---the one that leads to the undadjusted p-values--- reasonable?), or to the representativeness issue (what are we willing to say when our adjusted p <10^-9 comes from an observational study with 3 schizophrenic patients and 4 bipolar patients?). Best, R. > > 1.Look at p-values for genes that are not called present in any of the > arrays, I suspect some are slipping through as there is still a peak of > low p-values. > > 2.Look at p-values for genes that have not been previously reported as > regulated by the treatment - but most previous work is poorly replicated > and has arbitary cutoffs such as 2 fold, so big peak of low p-values - > not as big as for those that have been previously reported though - any > ideas how to use this difference? > > 3.Use a set of control or house-keeping genes to define a lower cut- off > - unfortunately some do respond to the treatment (also confirmed in > previous work), so how to select appropriate genes... > > 4.As it seems that gcrma values have a bimodal distribution - any ideas > on how to utilise the lower peak (that presumably represents 'absent' > genes), to calculate a threshold. > > 5.Choose a fdr p-value of 0.01, 0.001 or 0.0001, assuming they are > approximately giving you corresponding false positive rates? > > 6. 'Decide' how many genes you want to be differentially expressed, and > then select one of the above criteria appropriately, this obviously > works as you'd like ;-) but is tricky to justify! > > Cheers, > Matt > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor -- Ram?n D?az-Uriarte Bioinformatics Unit Centro Nacional de Investigaciones Oncol?gicas (CNIO) (Spanish National Cancer Center) Melchor Fern?ndez Almagro, 3 28029 Madrid (Spain) Fax: +-34-91-224-6972 Phone: +-34-91-224-6900 http://ligarto.org/rdiaz PGP KeyID: 0xE89B3462 (http://ligarto.org/rdiaz/0xE89B3462.asc)

ADD COMMENT • link 19.6 years ago Ramon Diaz ★ 1.1k

0

Entering edit mode

Suresh Gopalan ▴ 60

@suresh-gopalan-932

Last seen 9.6 years ago

[BioC] Selecting p-value cutoffs for differential expressionHi While the answer for selecting an optimal threshold is not specifically addressed for a unified index case, I have proposed and tested a 'data scaling' approach to see the effect of a selected threshold on the identification of differentials of different magnitudes. Basically, the approach is to scale the whole dataset to different ratios of interest and apply the statistical threshold being used to see what proportion of false negatives one would expect. This approach is used to generate Fig. 3A in the article "ResurfP: a response surface aided parametric test for identifying differentials in GeneChip based oligonucleotide array experiments" in the deposited research section of Genome Biology: http://genomebiology.com/preprint/. The URL for the full article that partially address this question and of selecting threshold in probe- level analysis of GeneChip arrays (or identification of differentials when studying other features using multiple independent measurements) can be found at http://genomebiology.com/2004/5/11/P14 . Hope this helps. Suresh ----- Original Message ----- From: Suresh Gopalan To: Suresh Gopalan Sent: Thursday, September 30, 2004 10:53 AM Subject: Emailing: 005970.htm [BioC] Selecting p-value cutoffs for differential expression Matthew Hannah Hannah at mpimp-golm.mpg.de Tue Sep 7 17:20:22 CEST 2004 a.. Previous message: [BioC] Do I now understand Limma contrast matrices?? b.. Next message: [BioC] Selecting p-value cutoffs for differential expression c.. Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] ---------------------------------------------------------------------- -------- Hi, Before the main question a minor point from the limma guide (as I'm using it to compute the p-values). In the swirl example there is the following sentence after the toptable is produced, are the stats not independent because there are duplicate spots, or is there another reason that I should be aware of? "Beware that the Benjamini and Hochberg method used to control the false discovery rate assumes independent statistics which we do not have here (see help(p.adjust))." Anyway, this aside. I'm looking to canvas opinion on how to select a p-value cutoff for genes that are differentially expressed, hopefully also allowing an assessment of false positive and negative rates aswell. I've been playing around with the following, but none seems satisfactory. Anyone have any input/experience on this topic? 1.Look at p-values for genes that are not called present in any of the arrays, I suspect some are slipping through as there is still a peak of low p-values. 2.Look at p-values for genes that have not been previously reported as regulated by the treatment - but most previous work is poorly replicated and has arbitary cutoffs such as 2 fold, so big peak of low p-values - not as big as for those that have been previously reported though - any ideas how to use this difference? 3.Use a set of control or house-keeping genes to define a lower cut- off - unfortunately some do respond to the treatment (also confirmed in previous work), so how to select appropriate genes... 4.As it seems that gcrma values have a bimodal distribution - any ideas on how to utilise the lower peak (that presumably represents 'absent' genes), to calculate a threshold. 5.Choose a fdr p-value of 0.01, 0.001 or 0.0001, assuming they are approximately giving you corresponding false positive rates? 6. 'Decide' how many genes you want to be differentially expressed, and then select one of the above criteria appropriately, this obviously works as you'd like ;-) but is tricky to justify! Cheers, Matt ---------------------------------------------------------------------- -------- a.. Previous message: [BioC] Do I now understand Limma contrast matrices?? b.. Next message: [BioC] Selecting p-value cutoffs for differential expression c.. Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] ---------------------------------------------------------------------- -------- More information about the Bioconductor mailing list [[alternative HTML version deleted]]

ADD COMMENT • link 19.6 years ago Suresh Gopalan ▴ 60

0

Entering edit mode

Suresh Gopalan ▴ 20

@suresh-gopalan-933

Last seen 9.6 years ago

[BioC] Selecting p-value cutoffs for differential expressionHi While the answer for selecting an optimal threshold is not specifically addressed for a unified index case, I have proposed and tested a 'data scaling' approach to see the effect of a selected threshold on the identification of differentials of different magnitudes. Basically, the approach is to scale the whole dataset to different ratios of interest and apply the statistical threshold being used to see what proportion of false negatives one would expect. This approach is used to generate Fig. 3A in the article "ResurfP: a response surface aided parametric test for identifying differentials in GeneChip based oligonucleotide array experiments" in the deposited research section of Genome Biology: http://genomebiology.com/preprint/. The URL for the full article that partially address this question and of selecting threshold in probe- level analysis of GeneChip arrays (or identification of differentials when studying other features using multiple independent measurements) can be found at http://genomebiology.com/2004/5/11/P14 . Hope this helps. Suresh [BioC] Selecting p-value cutoffs for differential expression Matthew Hannah Hannah at mpimp-golm.mpg.de Tue Sep 7 17:20:22 CEST 2004 a.. Previous message: [BioC] Do I now understand Limma contrast matrices?? b.. Next message: [BioC] Selecting p-value cutoffs for differential expression c.. Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] ---------------------------------------------------------------------- -------- Hi, Before the main question a minor point from the limma guide (as I'm using it to compute the p-values). In the swirl example there is the following sentence after the toptable is produced, are the stats not independent because there are duplicate spots, or is there another reason that I should be aware of? "Beware that the Benjamini and Hochberg method used to control the false discovery rate assumes independent statistics which we do not have here (see help(p.adjust))." Anyway, this aside. I'm looking to canvas opinion on how to select a p-value cutoff for genes that are differentially expressed, hopefully also allowing an assessment of false positive and negative rates aswell. I've been playing around with the following, but none seems satisfactory. Anyone have any input/experience on this topic? 1.Look at p-values for genes that are not called present in any of the arrays, I suspect some are slipping through as there is still a peak of low p-values. 2.Look at p-values for genes that have not been previously reported as regulated by the treatment - but most previous work is poorly replicated and has arbitary cutoffs such as 2 fold, so big peak of low p-values - not as big as for those that have been previously reported though - any ideas how to use this difference? 3.Use a set of control or house-keeping genes to define a lower cut- off - unfortunately some do respond to the treatment (also confirmed in previous work), so how to select appropriate genes... 4.As it seems that gcrma values have a bimodal distribution - any ideas on how to utilise the lower peak (that presumably represents 'absent' genes), to calculate a threshold. 5.Choose a fdr p-value of 0.01, 0.001 or 0.0001, assuming they are approximately giving you corresponding false positive rates? 6. 'Decide' how many genes you want to be differentially expressed, and then select one of the above criteria appropriately, this obviously works as you'd like ;-) but is tricky to justify! Cheers, Matt ---------------------------------------------------------------------- -------- a.. Previous message: [BioC] Do I now understand Limma contrast matrices?? b.. Next message: [BioC] Selecting p-value cutoffs for differential expression c.. Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] ---------------------------------------------------------------------- -------- More information about the Bioconductor mailing list [[alternative HTML version deleted]]

ADD COMMENT • link 19.6 years ago Suresh Gopalan ▴ 20

Login before adding your answer.