(stupid) question about wilcoxon test and finding interesting genes

0

Entering edit mode

Dipl.-Ing. Johannes Rainer ▴ 430

@dipl-ing-johannes-rainer-846

Last seen 9.7 years ago

hi, i must excuse myself for my question, but i'm not really good in statistics... we have done affymetrix genechips with samples from patients before and after treatment. until now i searched for genes that are influenced by the treatment using M values but i wanted also to apply a statistical test to get some proof that the genes i found are significant. so i applied a wilcoxon paired test to the expression values (one test per gene). my samples size is 13 (13 chips with samples before treatment and 13 afterwards). i subtracted the values after treatment from those before treatment ( p.vals <- apply((untreated-treated),MARGIN=1,wilcox.test) , untreated is a matrix with 13 columns and 54000 rows (genes) and the same is treated). according to the p values i got nearly every gene is significant, also if the gene is not regulated. so my question, do i have to correct the p values or was i totally wrong with the assumption to get significant (and regulated) genes in this way? thanks

• 1.8k views

ADD COMMENT • link updated 19.3 years ago by Naomi Altman ★ 6.0k • written 19.3 years ago by Dipl.-Ing. Johannes Rainer ▴ 430

0

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 4 days ago

United States

Dipl.-Ing. Johannes Rainer wrote: > hi, > i must excuse myself for my question, but i'm not really good in > statistics... > > we have done affymetrix genechips with samples from patients before and > after treatment. until now i searched for genes that are influenced by > the treatment using M values but i wanted also to apply a statistical > test to get some proof that the genes i found are significant. > > so i applied a wilcoxon paired test to the expression values (one test > per gene). my samples size is 13 (13 chips with samples before treatment > and 13 afterwards). i subtracted the values after treatment from those > before treatment ( > > p.vals <- apply((untreated-treated),MARGIN=1,wilcox.test) , untreated is > a matrix with 13 columns and 54000 rows (genes) and the same is > treated). according to the p values i got nearly every gene is > significant, also if the gene is not regulated. > > so my question, do i have to correct the p values or was i totally wrong > with the assumption to get significant (and regulated) genes in this way? You have to correct the p-values to account for the fact that you have done 54,000 simultaneous tests. See e.g., ?p.adjust Jim > > thanks > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor -- James W. MacDonald Affymetrix and cDNA Microarray Core University of Michigan Cancer Center 1500 E. Medical Center Drive 7410 CCGC Ann Arbor MI 48109

ADD COMMENT • link 19.3 years ago James W. MacDonald 65k

0

Entering edit mode

Naomi Altman ★ 6.0k

@naomi-altman-380

Last seen 3.1 years ago

United States

If I understand what you did, you should have only 1 column of p-values - 1 per gene. So, I think your apply command did not work as you expected (although I think it should have). My understanding is that you have 2 arrays per patient and took the 13 M values. Applying a Wilcoxon test to each row should test that the median difference is 0. Try doing the test on a couple of rows and then compare with the output you obtained. After you get 1 p-value per gene, you should apply a multiple comparisons adjustment. FDR is popular and can be computed using the "qvalue" library in Bioconductor. --Naomi At 09:44 AM 2/11/2005, Dipl.-Ing. Johannes Rainer wrote: >hi, >i must excuse myself for my question, but i'm not really good in statistics... > >we have done affymetrix genechips with samples from patients before and >after treatment. until now i searched for genes that are influenced by the >treatment using M values but i wanted also to apply a statistical test to >get some proof that the genes i found are significant. > >so i applied a wilcoxon paired test to the expression values (one test per >gene). my samples size is 13 (13 chips with samples before treatment and >13 afterwards). i subtracted the values after treatment from those before >treatment ( > >p.vals <- apply((untreated-treated),MARGIN=1,wilcox.test) , untreated is a >matrix with 13 columns and 54000 rows (genes) and the same is treated). >according to the p values i got nearly every gene is significant, also if >the gene is not regulated. > >so my question, do i have to correct the p values or was i totally wrong >with the assumption to get significant (and regulated) genes in this way? > >thanks > >_______________________________________________ >Bioconductor mailing list >Bioconductor@stat.math.ethz.ch >https://stat.ethz.ch/mailman/listinfo/bioconductor Naomi S. Altman 814-865-3791 (voice) Associate Professor Bioinformatics Consulting Center Dept. of Statistics 814-863-7114 (fax) Penn State University 814-865-1348 (Statistics) University Park, PA 16802-2111

ADD COMMENT • link 19.3 years ago Naomi Altman ★ 6.0k

0

Entering edit mode

yes, you are right, i applied a wilcoxon test to the M values (which is in this case the same as the paired wilcox of the log2 expression values). i got a vector of p values, one p value for each gene. the p values i got were a little bit surprising to me, because i found genes significant, although they were not that much different between the sample and the control group. something about 6000 genes have a p value less then 0.05, so this might be ok (i was a little bit too quick by saying that every gene is significantly different :) ). so the next step is to correct the p values... i thought correcting p values is only necessary when i do multiple testing? sorry for my question, but i am more used to do some programming and work with databases then doing statistics... thanks to all your answers, you help me very much! thanks! Quoting Naomi Altman <naomi@stat.psu.edu>: > If I understand what you did, you should have only 1 column of > p-values - 1 per gene. So, I think your apply command did not work > as you expected (although I think it should have). > > My understanding is that you have 2 arrays per patient and took the > 13 M values. Applying a Wilcoxon test to each row should test that > the median difference is 0. > > Try doing the test on a couple of rows and then compare with the > output you obtained. > > After you get 1 p-value per gene, you should apply a multiple > comparisons adjustment. FDR is popular and can be computed using the > "qvalue" library in Bioconductor. > > --Naomi > > At 09:44 AM 2/11/2005, Dipl.-Ing. Johannes Rainer wrote: >> hi, >> i must excuse myself for my question, but i'm not really good in >> statistics... >> >> we have done affymetrix genechips with samples from patients before >> and after treatment. until now i searched for genes that are >> influenced by the treatment using M values but i wanted also to >> apply a statistical test to get some proof that the genes i found >> are significant. >> >> so i applied a wilcoxon paired test to the expression values (one >> test per gene). my samples size is 13 (13 chips with samples before >> treatment and 13 afterwards). i subtracted the values after >> treatment from those before treatment ( >> >> p.vals <- apply((untreated-treated),MARGIN=1,wilcox.test) , >> untreated is a matrix with 13 columns and 54000 rows (genes) and the >> same is treated). according to the p values i got nearly every gene >> is significant, also if the gene is not regulated. >> >> so my question, do i have to correct the p values or was i totally >> wrong with the assumption to get significant (and regulated) genes >> in this way? >> >> thanks >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor > > Naomi S. Altman 814-865-3791 (voice) > Associate Professor > Bioinformatics Consulting Center > Dept. of Statistics 814-863-7114 (fax) > Penn State University 814-865-1348 (Statistics) > University Park, PA 16802-2111 > >

ADD REPLY • link 19.3 years ago Dipl.-Ing. Johannes Rainer ▴ 430

0

Entering edit mode

When there is no differential expression (and if the genes were independent) then the p-values should be uniformly distributed. So, if you test at level alpha and you have N genes, you SHOULD find alpha*N genes that have significant results (and all are false positives). The FDR correction does 2 things simultaneously - it estimates the percentage of genes that differentially express (using departures of the p-values from the uniform distribution" - and then estimates the False discovery rate for any observed p-value. I guess we have to ask what "necessary" and "multiple testing" mean. There are 2 kinds of error - false detects and false non-detects. We do not do this type of correction if we worry more about false non-detects. If false detects are a bigger problem, then the FDR estimate allows us to estimate when we have an acceptable rate. If you are really testing only a few genes on your arrays, I would not use FDR. If you are really testing all the genes, then I think you have a "highly multiple" testing situation. I don't really like the term "adjusted p-value" for FDR estimates. They are not probabilities, they are estimated error rates. But that issue was discussed a few weeks ago on this list. --Naomi >so the next step is to correct the p values... i thought correcting p >values is only necessary when i do multiple testing? sorry for my >question, but i am more used to do some programming and work with >databases then doing statistics... > > >>>_______________________________________________ >>>Bioconductor mailing list >>>Bioconductor@stat.math.ethz.ch >>>https://stat.ethz.ch/mailman/listinfo/bioconductor >> >>Naomi S. Altman 814-865-3791 (voice) >>Associate Professor >>Bioinformatics Consulting Center >>Dept. of Statistics 814-863-7114 (fax) >>Penn State University 814-865-1348 (Statistics) >>University Park, PA 16802-2111 >> > > > Naomi S. Altman 814-865-3791 (voice) Associate Professor Bioinformatics Consulting Center Dept. of Statistics 814-863-7114 (fax) Penn State University 814-865-1348 (Statistics) University Park, PA 16802-2111

ADD REPLY • link 19.3 years ago Naomi Altman ★ 6.0k

Login before adding your answer.