Invalid fold-filter

0

Entering edit mode

Bornman, Daniel M ▴ 110

@bornman-daniel-m-1391

Last seen 9.7 years ago

Dear BioC Folks, As a bioinformatician within a Statistics department I often consult with real statisticians about the most appropriate test to apply to our microarray experiments. One issue that is being debated among our statisticians is whether some types of fold-filtering may be invalid or biased in nature. The types of fold-filtering in question are those that tend to NOT be non-specific. Some filtering of a 54K probe affy chip is useful prior to making decisions on differential expression and there are many examples in the Bioconductor documentation (particularly in the {genefilter} package) on how to do so. A popular method of non-specific filtering for reducing your probeset prior to applying statistics is to filter out low expressed probes followed by filtering out probes that do not show a minimum difference between quartiles. These two steps are non- specific in that they do not take into consideration the actual samples/arrays. On the other hand, if we had two groups of samples, say control versus treated, and we filtered out those probes that do not have a mean difference in expression of 2-fold between the control and treated groups, this filtering was based on the actual samples. This is NOT a non-specific filter. The problem then comes (or rather the debate here arises) when a t-test is calculated for each probe that passed the sample-specific fold-filtering and the p-values are adjusted for multiple comparisons by, for example the Benjamini & Hochberg method. Is it valid to fold-filter using the sample identity as a criteria followed by correcting for multiple comparisons using just those probes that made it through the fold-filter? When correcting for multiple comparisons you take a penalty for the number of comparison you are correcting. The larger the pool of comparisons, the larger the penalty, thus the larger the adjusted p-value. Or more importantly, the smaller the set, the less your adjusted p-value is adjusted (increased) relative to your raw p-value. The argument is that you used the actual samples themselves you are comparing to unfairly reduce the adjusted p-value penalty. Has anyone considered this issue or heard of problems of using a specific type of filtering rather than a non-specific one? Thank You for any responses. Daniel Bornman Research Scientist Battelle Memorial Institute 505 King Ave Columbus, OH 43201

probe affy probe affy • 2.0k views

ADD COMMENT • link updated 18.2 years ago by Stephen Henderson ★ 1.0k • written 18.2 years ago by Bornman, Daniel M ▴ 110

0

Entering edit mode

rgentleman ★ 5.5k

@rgentleman-7725

Last seen 9.0 years ago

United States

Bornman, Daniel M wrote: > Dear BioC Folks, > > As a bioinformatician within a Statistics department I often consult > with real statisticians about the most appropriate test to apply to our > microarray experiments. One issue that is being debated among our > statisticians is whether some types of fold-filtering may be invalid or > biased in nature. The types of fold-filtering in question are those > that tend to NOT be non-specific. > Some filtering of a 54K probe affy chip is useful prior to making > decisions on differential expression and there are many examples in the > Bioconductor documentation (particularly in the {genefilter} package) on > how to do so. A popular method of non-specific filtering for reducing > your probeset prior to applying statistics is to filter out low > expressed probes followed by filtering out probes that do not show a > minimum difference between quartiles. These two steps are non- specific > in that they do not take into consideration the actual samples/arrays. > On the other hand, if we had two groups of samples, say control versus > treated, and we filtered out those probes that do not have a mean > difference in expression of 2-fold between the control and treated > groups, this filtering was based on the actual samples. This is NOT a > non-specific filter. The problem then comes (or rather the debate here > arises) when a t-test is calculated for each probe that passed the > sample-specific fold-filtering and the p-values are adjusted for > multiple comparisons by, for example the Benjamini & Hochberg method. > Is it valid to fold-filter using the sample identity as a criteria > followed by correcting for multiple comparisons using just those probes > that made it through the fold-filter? When correcting for multiple > comparisons you take a penalty for the number of comparison you are > correcting. The larger the pool of comparisons, the larger the penalty, > thus the larger the adjusted p-value. Or more importantly, the smaller > the set, the less your adjusted p-value is adjusted (increased) relative > to your raw p-value. The argument is that you used the actual samples > themselves you are comparing to unfairly reduce the adjusted p-value > penalty. It is not valid to use phenotype to compute t-statistics for a particular phenotype and filter based on those p-values and to then use p-value correction methods on the result. I don't think we need research, it seems pretty obvious that this is not a valid approach. You can do non-specific filtering, but all you are really doing there is to remove genes that are inherently uninteresting no matter what the phenotype of the corresponding sample (if there is no variation in expression for a particular gene across samples then it has no information about the phenotype of the sample). Filtering on low values is probably a bad idea although many do it (and I used to, and still do sometimes depending on the task at hand). Best wishes Robert > Has anyone considered this issue or heard of problems of using a > specific type of filtering rather than a non-specific one? > Thank You for any responses. > > Daniel Bornman > Research Scientist > Battelle Memorial Institute > 505 King Ave > Columbus, OH 43201 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > -- Robert Gentleman, PhD Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 PO Box 19024 Seattle, Washington 98109-1024 206-667-7700 rgentlem at fhcrc.org

ADD COMMENT • link 18.2 years ago rgentleman ★ 5.5k

0

Entering edit mode

Bornman, Daniel M ▴ 110

@bornman-daniel-m-1391

Last seen 9.7 years ago

I of course agree that filtering on a variable (phenotype) that will be used later to calculate adjusted p-values is flawed and therefore it is not a method I would implement; however, it seems that many that describe fold-filtering are doing just that. Thank you for your response. -----Original Message----- From: Robert Gentleman [mailto:rgentlem@fhcrc.org] Sent: Friday, February 17, 2006 2:15 PM To: Bornman, Daniel M Cc: bioconductor at stat.math.ethz.ch Subject: Re: [BioC] Invalid fold-filter Bornman, Daniel M wrote: > Dear BioC Folks, > > As a bioinformatician within a Statistics department I often consult > with real statisticians about the most appropriate test to apply to > our microarray experiments. One issue that is being debated among our > statisticians is whether some types of fold-filtering may be invalid > or biased in nature. The types of fold-filtering in question are > those that tend to NOT be non-specific. > Some filtering of a 54K probe affy chip is useful prior to making > decisions on differential expression and there are many examples in > the Bioconductor documentation (particularly in the {genefilter} > package) on how to do so. A popular method of non-specific filtering > for reducing your probeset prior to applying statistics is to filter > out low expressed probes followed by filtering out probes that do not > show a minimum difference between quartiles. These two steps are > non-specific in that they do not take into consideration the actual samples/arrays. > On the other hand, if we had two groups of samples, say control versus > treated, and we filtered out those probes that do not have a mean > difference in expression of 2-fold between the control and treated > groups, this filtering was based on the actual samples. This is NOT a > non-specific filter. The problem then comes (or rather the debate > here > arises) when a t-test is calculated for each probe that passed the > sample-specific fold-filtering and the p-values are adjusted for > multiple comparisons by, for example the Benjamini & Hochberg method. > Is it valid to fold-filter using the sample identity as a criteria > followed by correcting for multiple comparisons using just those > probes that made it through the fold-filter? When correcting for > multiple comparisons you take a penalty for the number of comparison > you are correcting. The larger the pool of comparisons, the larger > the penalty, thus the larger the adjusted p-value. Or more > importantly, the smaller the set, the less your adjusted p-value is > adjusted (increased) relative to your raw p-value. The argument is > that you used the actual samples themselves you are comparing to > unfairly reduce the adjusted p-value penalty. It is not valid to use phenotype to compute t-statistics for a particular phenotype and filter based on those p-values and to then use p-value correction methods on the result. I don't think we need research, it seems pretty obvious that this is not a valid approach. You can do non-specific filtering, but all you are really doing there is to remove genes that are inherently uninteresting no matter what the phenotype of the corresponding sample (if there is no variation in expression for a particular gene across samples then it has no information about the phenotype of the sample). Filtering on low values is probably a bad idea although many do it (and I used to, and still do sometimes depending on the task at hand). Best wishes Robert > Has anyone considered this issue or heard of problems of using a > specific type of filtering rather than a non-specific one? > Thank You for any responses. > > Daniel Bornman > Research Scientist > Battelle Memorial Institute > 505 King Ave > Columbus, OH 43201 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > -- Robert Gentleman, PhD Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 PO Box 19024 Seattle, Washington 98109-1024 206-667-7700 rgentlem at fhcrc.org

ADD COMMENT • link 18.2 years ago Bornman, Daniel M ▴ 110

0

Entering edit mode

Wittner, Ben ▴ 290

@wittner-ben-1031

Last seen 8.3 years ago

USA/Boston/Mass General Hospital

Robert, Could you explain why you say below that filtering on low values is probably a bad idea in many cases? Also, in what cases do you filter low values and in what cases not? I filter out probe-sets for which expression values for two or more classes of interest are low on the theory that such values are dominated by noise and fold changes calculated between two such classes are not meaningful. Thanks. -Ben > You can do non-specific filtering, but all you are really doing there > is to remove genes that are inherently uninteresting no matter what the > phenotype of the corresponding sample (if there is no variation in > expression for a particular gene across samples then it has no > information about the phenotype of the sample). Filtering on low values > is probably a bad idea although many do it (and I used to, and still do > sometimes depending on the task at hand). > > > Best wishes > Robert ------------------------------------------------------ Ben Wittner, 617-643-3166, wittner.ben at mgh.harvard.edu

ADD COMMENT • link 18.2 years ago Wittner, Ben ▴ 290

0

Entering edit mode

Hi Ben, Basically the problem is that the actual observed intensity only indirectly relates to copy number of the mRNA species being assayed. It is valid to do within probe-set (or probe) between array comparisons (which basically means if, on array 1 the probe has value x, and on array 2 it has value y, with y>x, then we are pretty sure there is more mRNA for that species in the second sample than the first). But within array, between spot comparisons are not valid, in the sense that just because one spot on array 1 is brighter than another spot on array 1 does not mean that the underlying abundance of mRNA is ordered in the same way. There are lots of opportunities for attenuation (if the samples are amplified, which I think all are now, I don't believe that the amplification procedure has equal efficacy on all mRNAs; I don't believe all mRNAs label with equal efficiency or hybridize with equal efficiency). A correlate of the observation that within array between probe comparisons are not valid is that one should not filter on level. Just because the spot is low in intensity does not necessarily mean very much. It also seems to be the case (although I have not checked recently) that transcription factors are fairly low abundance. I won't disagree that spots that correspond to low intensities are more likely to be noise, and so are candidates for suspicion, I am just not sure how you could be sure that the baby has not departed with the bath water. The times when I still use level are when there are too many genes that show appropriate levels of variation and I need to further reduce the gene set or when I am looking for a biomarker (that is a longer discussion). But basically if what you are looking for are good reliable signatures that differentiate one group from the other, then you don't much care about the low expressing genes. In some of those cases you are not trying to understand the biology (and when comprehension is your objective then keeping everything is a good idea, IMHO) but rather most interested in an objective measure that can be used to classify samples. I am sure there are other reasons as well. Best wishes Robert Wittner, Ben, Ph.D. wrote: > Robert, > > Could you explain why you say below that filtering on low values is probably a > bad idea in many cases? Also, in what cases do you filter low values and in what > cases not? > > I filter out probe-sets for which expression values for two or more classes of > interest are low on the theory that such values are dominated by noise and fold > changes calculated between two such classes are not meaningful. > > Thanks. > -Ben > > >> You can do non-specific filtering, but all you are really doing there >>is to remove genes that are inherently uninteresting no matter what the >>phenotype of the corresponding sample (if there is no variation in >>expression for a particular gene across samples then it has no >>information about the phenotype of the sample). Filtering on low values >>is probably a bad idea although many do it (and I used to, and still do >>sometimes depending on the task at hand). >> >> >> Best wishes >> Robert > > > > ------------------------------------------------------ > Ben Wittner, 617-643-3166, wittner.ben at mgh.harvard.edu > > -- Robert Gentleman, PhD Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 PO Box 19024 Seattle, Washington 98109-1024 206-667-7700 rgentlem at fhcrc.org

ADD REPLY • link 18.2 years ago rgentleman ★ 5.5k

0

Entering edit mode

Bornman, Daniel M ▴ 110

@bornman-daniel-m-1391

Last seen 9.7 years ago

Robert, After reading your response to my initial question, I do not believe you addressed exactly what I attempted to describe. Please pardon me for not being clear. I think your response assumed I was filtering on unadjusted p-values then applying a correction such as Benjamini & Hochberg to a reduced set. My question was rather on the validity of first filtering each gene based on fold-change between two sample groups (i.e. controls vs treated) then calculating a test-statistic, raw p-value and corrected p-value on each gene that passed the fold-change filter. I am worried that using the group phenotype description to filter followed by applying a p-value correction is unfairly reducing my multiple comparison penalty. I propose that a less biased approach to fold-filtering would be to filter probes based on the mean of the lower half versus the mean of the upper half of expression values at each probe regardless of the phenotype (non-specific). This would surely (except in some instances where a phenotype causes drastic expression changes) cause the fold-filtered set to be larger and thus not unfairly decrease the multiple comparison penalty when computing adjusted p-values. Thank You, Daniel -----Original Message----- From: Robert Gentleman [mailto:rgentlem@fhcrc.org] Sent: Saturday, February 18, 2006 12:57 PM To: Bornman, Daniel M Subject: Re: [BioC] Invalid fold-filter Hi Daniel, I hope not, it is as you have noted a flawed approach. best wishes Robert Bornman, Daniel M wrote: > I of course agree that filtering on a variable (phenotype) that will > be used later to calculate adjusted p-values is flawed and therefore > it is not a method I would implement; however, it seems that many that > describe fold-filtering are doing just that. > Thank you for your response. > > -----Original Message----- > From: Robert Gentleman [mailto:rgentlem at fhcrc.org] > Sent: Friday, February 17, 2006 2:15 PM > To: Bornman, Daniel M > Cc: bioconductor at stat.math.ethz.ch > Subject: Re: [BioC] Invalid fold-filter > > > > Bornman, Daniel M wrote: > >>Dear BioC Folks, >> >>As a bioinformatician within a Statistics department I often consult >>with real statisticians about the most appropriate test to apply to >>our microarray experiments. One issue that is being debated among our > > >>statisticians is whether some types of fold-filtering may be invalid >>or biased in nature. The types of fold-filtering in question are >>those that tend to NOT be non-specific. >>Some filtering of a 54K probe affy chip is useful prior to making >>decisions on differential expression and there are many examples in >>the Bioconductor documentation (particularly in the {genefilter} >>package) on how to do so. A popular method of non-specific filtering >>for reducing your probeset prior to applying statistics is to filter >>out low expressed probes followed by filtering out probes that do not >>show a minimum difference between quartiles. These two steps are >>non-specific in that they do not take into consideration the actual > > samples/arrays. > >>On the other hand, if we had two groups of samples, say control versus > > >>treated, and we filtered out those probes that do not have a mean >>difference in expression of 2-fold between the control and treated >>groups, this filtering was based on the actual samples. This is NOT a > > >>non-specific filter. The problem then comes (or rather the debate >>here >>arises) when a t-test is calculated for each probe that passed the >>sample-specific fold-filtering and the p-values are adjusted for >>multiple comparisons by, for example the Benjamini & Hochberg method. >>Is it valid to fold-filter using the sample identity as a criteria >>followed by correcting for multiple comparisons using just those >>probes that made it through the fold-filter? When correcting for >>multiple comparisons you take a penalty for the number of comparison >>you are correcting. The larger the pool of comparisons, the larger >>the penalty, thus the larger the adjusted p-value. Or more >>importantly, the smaller the set, the less your adjusted p-value is >>adjusted (increased) relative to your raw p-value. The argument is >>that you used the actual samples themselves you are comparing to >>unfairly reduce the adjusted p-value penalty. > > > It is not valid to use phenotype to compute t-statistics for a > particular phenotype and filter based on those p-values and to then > use p-value correction methods on the result. I don't think we need > research, it seems pretty obvious that this is not a valid approach. > > You can do non-specific filtering, but all you are really doing > there is to remove genes that are inherently uninteresting no matter > what the phenotype of the corresponding sample (if there is no variation in > expression for a particular gene across samples then it has no > information about the phenotype of the sample). Filtering on low > values is probably a bad idea although many do it (and I used to, and > still do sometimes depending on the task at hand). > > > Best wishes > Robert > > >>Has anyone considered this issue or heard of problems of using a >>specific type of filtering rather than a non-specific one? >>Thank You for any responses. >> >>Daniel Bornman >>Research Scientist >>Battelle Memorial Institute >>505 King Ave >>Columbus, OH 43201 >> >>_______________________________________________ >>Bioconductor mailing list >>Bioconductor at stat.math.ethz.ch >>https://stat.ethz.ch/mailman/listinfo/bioconductor >> > > > -- > Robert Gentleman, PhD > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 > PO Box 19024 Seattle, Washington 98109-1024 206-667-7700 > rgentlem at fhcrc.org > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > -- Robert Gentleman, PhD Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 PO Box 19024 Seattle, Washington 98109-1024 206-667-7700 rgentlem at fhcrc.org

ADD COMMENT • link 18.2 years ago Bornman, Daniel M ▴ 110

0

Entering edit mode

On 2/20/06 8:33, "Bornman, Daniel M" <bornmand at="" battelle.org=""> wrote: > Robert, > > After reading your response to my initial question, I do not believe you > addressed exactly what I attempted to describe. Please pardon me for > not being clear. I think your response assumed I was filtering on > unadjusted p-values then applying a correction such as Benjamini & > Hochberg to a reduced set. > > My question was rather on the validity of first filtering each gene > based on fold-change between two sample groups (i.e. controls vs > treated) then calculating a test-statistic, raw p-value and corrected > p-value on each gene that passed the fold-change filter. I am worried > that using the group phenotype description to filter followed by > applying a p-value correction is unfairly reducing my multiple > comparison penalty. > > I propose that a less biased approach to fold-filtering would be to > filter probes based on the mean of the lower half versus the mean of the > upper half of expression values at each probe regardless of the > phenotype (non-specific). This would surely (except in some instances > where a phenotype causes drastic expression changes) cause the > fold-filtered set to be larger and thus not unfairly decrease the > multiple comparison penalty when computing adjusted p-values. Daniel, This is routinely done by many folks. You could use the interquartile range, the standard deviation, or many others. As you suggest, such filtering is done in an unsupervised manner (not using groups). Sean

ADD REPLY • link 18.2 years ago Sean Davis 21k

0

Entering edit mode

Hello all, I have also pondered over the issue of filtering genes to reduce the amount of multiple hypothesis correction and what is or isn't valid statistically. I do routinely filter on some estimate of "presence", either Affy's P/M/A calls, or for spotted arrays, comparison to blanks, buffers and/or negative controls. However, I only filter if a gene is not deemed "present" on all of the arrays; my rationale for this is that if it's a whole-genome array, only a subset of those genes will be expressed in any particular tissue, developmental stage, etc. I keep a gene if it is "present" in at least one sample rather than say, half the samples as I've seen in other analyses, because the possibility exists that a gene may be expressed in only one of the treatment groups. On the other hand, I've never been comfortable with filtering on even a non-specific measure of variation across arrays. After reading's Jim's response, I agree that if you're mainly interested in sample classification, then it could be reasonable to filter out genes that do not vary, but it still doesn't seem right to do this if you're mainly interested in determining differential expression between two or more known classes. My reasoning is that the p-values are based on the null F-distribution, and that by removing genes with little variance, you are in effect removing the left side of the F-distribution, which would seem to invalidate the p-values because the area under the remaining distribution has changed. If you couldn't tell, my logic is not based on formal statistical theory but rather on my intuitive feel on the matter! Cheers, Jenny Jenny Drnevich, Ph.D. Functional Genomics Bioinformatics Specialist W.M. Keck Center for Comparative and Functional Genomics Roy J. Carver Biotechnology Center University of Illinois, Urbana-Champaign 330 ERML 1201 W. Gregory Dr. Urbana, IL 61801 USA ph: 217-244-7355 fax: 217-265-5066 e-mail: drnevich at uiuc.edu

ADD REPLY • link 18.2 years ago Jenny Drnevich ★ 2.2k

0

Entering edit mode

An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20060220/ 1c9488a9/attachment.pl

ADD REPLY • link 18.2 years ago Sharon Anbu ▴ 480

0

Entering edit mode

Bornman, Daniel M wrote: > Robert, > > After reading your response to my initial question, I do not believe you > addressed exactly what I attempted to describe. Please pardon me for > not being clear. I think your response assumed I was filtering on > unadjusted p-values then applying a correction such as Benjamini & > Hochberg to a reduced set. > > My question was rather on the validity of first filtering each gene > based on fold-change between two sample groups (i.e. controls vs > treated) then calculating a test-statistic, raw p-value and corrected > p-value on each gene that passed the fold-change filter. I am worried > that using the group phenotype description to filter followed by > applying a p-value correction is unfairly reducing my multiple > comparison penalty. It does, and it does not matter how you get there, using one test (fold change) to filter and a different test (t-test) for p-value correction does not really change the fact that if both tests make use of the same way to define samples, then there are problems with the interpretation. > > I propose that a less biased approach to fold-filtering would be to > filter probes based on the mean of the lower half versus the mean of the > upper half of expression values at each probe regardless of the > phenotype (non-specific). This would surely (except in some instances > where a phenotype causes drastic expression changes) cause the > fold-filtered set to be larger and thus not unfairly decrease the > multiple comparison penalty when computing adjusted p-values. > Well that is one thing, but really IMHO, you are better off filtering on variance than some rather arbitrary division into two groups (why 1/2 - many of the classification problems I deal with are very unbalanced and 1/2 would be a pretty bad choice). And, it is variation that is important (whence ANOVA - ANalysis Of VAriance). Best wishes, Robert > > Thank You, > Daniel > > > > -----Original Message----- > From: Robert Gentleman [mailto:rgentlem at fhcrc.org] > Sent: Saturday, February 18, 2006 12:57 PM > To: Bornman, Daniel M > Subject: Re: [BioC] Invalid fold-filter > > Hi Daniel, > I hope not, it is as you have noted a flawed approach. > > best wishes > Robert > > Bornman, Daniel M wrote: > >>I of course agree that filtering on a variable (phenotype) that will >>be used later to calculate adjusted p-values is flawed and therefore >>it is not a method I would implement; however, it seems that many that > > >>describe fold-filtering are doing just that. >>Thank you for your response. >> >>-----Original Message----- >>From: Robert Gentleman [mailto:rgentlem at fhcrc.org] >>Sent: Friday, February 17, 2006 2:15 PM >>To: Bornman, Daniel M >>Cc: bioconductor at stat.math.ethz.ch >>Subject: Re: [BioC] Invalid fold-filter >> >> >> >>Bornman, Daniel M wrote: >> >> >>>Dear BioC Folks, >>> >>>As a bioinformatician within a Statistics department I often consult >>>with real statisticians about the most appropriate test to apply to >>>our microarray experiments. One issue that is being debated among our >> >> >>>statisticians is whether some types of fold-filtering may be invalid >>>or biased in nature. The types of fold-filtering in question are >>>those that tend to NOT be non-specific. >>>Some filtering of a 54K probe affy chip is useful prior to making >>>decisions on differential expression and there are many examples in >>>the Bioconductor documentation (particularly in the {genefilter} >>>package) on how to do so. A popular method of non-specific filtering >>>for reducing your probeset prior to applying statistics is to filter >>>out low expressed probes followed by filtering out probes that do not >>>show a minimum difference between quartiles. These two steps are >>>non-specific in that they do not take into consideration the actual >> >>samples/arrays. >> >> >>>On the other hand, if we had two groups of samples, say control versus >> >> >>>treated, and we filtered out those probes that do not have a mean >>>difference in expression of 2-fold between the control and treated >>>groups, this filtering was based on the actual samples. This is NOT a >> >> >>>non-specific filter. The problem then comes (or rather the debate >>>here >>>arises) when a t-test is calculated for each probe that passed the >>>sample-specific fold-filtering and the p-values are adjusted for >>>multiple comparisons by, for example the Benjamini & Hochberg method. >>>Is it valid to fold-filter using the sample identity as a criteria >>>followed by correcting for multiple comparisons using just those >>>probes that made it through the fold-filter? When correcting for >>>multiple comparisons you take a penalty for the number of comparison >>>you are correcting. The larger the pool of comparisons, the larger >>>the penalty, thus the larger the adjusted p-value. Or more >>>importantly, the smaller the set, the less your adjusted p-value is >>>adjusted (increased) relative to your raw p-value. The argument is >>>that you used the actual samples themselves you are comparing to >>>unfairly reduce the adjusted p-value penalty. >> >> >> It is not valid to use phenotype to compute t-statistics for a >>particular phenotype and filter based on those p-values and to then >>use p-value correction methods on the result. I don't think we need >>research, it seems pretty obvious that this is not a valid approach. >> >> You can do non-specific filtering, but all you are really doing >>there is to remove genes that are inherently uninteresting no matter >>what the phenotype of the corresponding sample (if there is no > > variation in > >>expression for a particular gene across samples then it has no >>information about the phenotype of the sample). Filtering on low >>values is probably a bad idea although many do it (and I used to, and >>still do sometimes depending on the task at hand). >> >> >> Best wishes >> Robert >> >> >> >>>Has anyone considered this issue or heard of problems of using a >>>specific type of filtering rather than a non-specific one? >>>Thank You for any responses. >>> >>>Daniel Bornman >>>Research Scientist >>>Battelle Memorial Institute >>>505 King Ave >>>Columbus, OH 43201 >>> >>>_______________________________________________ >>>Bioconductor mailing list >>>Bioconductor at stat.math.ethz.ch >>>https://stat.ethz.ch/mailman/listinfo/bioconductor >>> >> >> >>-- >>Robert Gentleman, PhD >>Program in Computational Biology >>Division of Public Health Sciences >>Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 >>PO Box 19024 Seattle, Washington 98109-1024 206-667-7700 >>rgentlem at fhcrc.org >> >>_______________________________________________ >>Bioconductor mailing list >>Bioconductor at stat.math.ethz.ch >>https://stat.ethz.ch/mailman/listinfo/bioconductor >> > > > -- > Robert Gentleman, PhD > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M2-B876 > PO Box 19024 > Seattle, Washington 98109-1024 > 206-667-7700 > rgentlem at fhcrc.org > -- Robert Gentleman, PhD Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 PO Box 19024 Seattle, Washington 98109-1024 206-667-7700 rgentlem at fhcrc.org

ADD REPLY • link 18.2 years ago rgentleman ★ 5.5k

0

Entering edit mode

I think it is unwise to filter based on observed data. This biases the results. On the other hand, filtering on a priori considerations, such as lack of annotation, should not be a problem. --Naomi At 12:19 AM 2/21/2006, Robert Gentleman wrote: >Bornman, Daniel M wrote: > > Robert, > > > > After reading your response to my initial question, I do not believe you > > addressed exactly what I attempted to describe. Please pardon me for > > not being clear. I think your response assumed I was filtering on > > unadjusted p-values then applying a correction such as Benjamini & > > Hochberg to a reduced set. > > > > My question was rather on the validity of first filtering each gene > > based on fold-change between two sample groups (i.e. controls vs > > treated) then calculating a test-statistic, raw p-value and corrected > > p-value on each gene that passed the fold-change filter. I am worried > > that using the group phenotype description to filter followed by > > applying a p-value correction is unfairly reducing my multiple > > comparison penalty. > > It does, and it does not matter how you get there, using one test >(fold change) to filter and a different test (t-test) for p-value >correction does not really change the fact that if both tests make use >of the same way to define samples, then there are problems with the >interpretation. > > > > > I propose that a less biased approach to fold-filtering would be to > > filter probes based on the mean of the lower half versus the mean of the > > upper half of expression values at each probe regardless of the > > phenotype (non-specific). This would surely (except in some instances > > where a phenotype causes drastic expression changes) cause the > > fold-filtered set to be larger and thus not unfairly decrease the > > multiple comparison penalty when computing adjusted p-values. > > > > Well that is one thing, but really IMHO, you are better off filtering >on variance than some rather arbitrary division into two groups (why 1/2 >- many of the classification problems I deal with are very unbalanced >and 1/2 would be a pretty bad choice). And, it is variation that is >important (whence ANOVA - ANalysis Of VAriance). > > Best wishes, > Robert > > > > > Thank You, > > Daniel > > > > > > > > -----Original Message----- > > From: Robert Gentleman [mailto:rgentlem at fhcrc.org] > > Sent: Saturday, February 18, 2006 12:57 PM > > To: Bornman, Daniel M > > Subject: Re: [BioC] Invalid fold-filter > > > > Hi Daniel, > > I hope not, it is as you have noted a flawed approach. > > > > best wishes > > Robert > > > > Bornman, Daniel M wrote: > > > >>I of course agree that filtering on a variable (phenotype) that will > >>be used later to calculate adjusted p-values is flawed and therefore > >>it is not a method I would implement; however, it seems that many that > > > > > >>describe fold-filtering are doing just that. > >>Thank you for your response. > >> > >>-----Original Message----- > >>From: Robert Gentleman [mailto:rgentlem at fhcrc.org] > >>Sent: Friday, February 17, 2006 2:15 PM > >>To: Bornman, Daniel M > >>Cc: bioconductor at stat.math.ethz.ch > >>Subject: Re: [BioC] Invalid fold-filter > >> > >> > >> > >>Bornman, Daniel M wrote: > >> > >> > >>>Dear BioC Folks, > >>> > >>>As a bioinformatician within a Statistics department I often consult > >>>with real statisticians about the most appropriate test to apply to > >>>our microarray experiments. One issue that is being debated among our > >> > >> > >>>statisticians is whether some types of fold-filtering may be invalid > >>>or biased in nature. The types of fold-filtering in question are > >>>those that tend to NOT be non-specific. > >>>Some filtering of a 54K probe affy chip is useful prior to making > >>>decisions on differential expression and there are many examples in > >>>the Bioconductor documentation (particularly in the {genefilter} > >>>package) on how to do so. A popular method of non-specific filtering > >>>for reducing your probeset prior to applying statistics is to filter > >>>out low expressed probes followed by filtering out probes that do not > >>>show a minimum difference between quartiles. These two steps are > >>>non-specific in that they do not take into consideration the actual > >> > >>samples/arrays. > >> > >> > >>>On the other hand, if we had two groups of samples, say control versus > >> > >> > >>>treated, and we filtered out those probes that do not have a mean > >>>difference in expression of 2-fold between the control and treated > >>>groups, this filtering was based on the actual samples. This is NOT a > >> > >> > >>>non-specific filter. The problem then comes (or rather the debate > >>>here > >>>arises) when a t-test is calculated for each probe that passed the > >>>sample-specific fold-filtering and the p-values are adjusted for > >>>multiple comparisons by, for example the Benjamini & Hochberg method. > >>>Is it valid to fold-filter using the sample identity as a criteria > >>>followed by correcting for multiple comparisons using just those > >>>probes that made it through the fold-filter? When correcting for > >>>multiple comparisons you take a penalty for the number of comparison > >>>you are correcting. The larger the pool of comparisons, the larger > >>>the penalty, thus the larger the adjusted p-value. Or more > >>>importantly, the smaller the set, the less your adjusted p-value is > >>>adjusted (increased) relative to your raw p-value. The argument is > >>>that you used the actual samples themselves you are comparing to > >>>unfairly reduce the adjusted p-value penalty. > >> > >> > >> It is not valid to use phenotype to compute t-statistics for a > >>particular phenotype and filter based on those p-values and to then > >>use p-value correction methods on the result. I don't think we need > >>research, it seems pretty obvious that this is not a valid approach. > >> > >> You can do non-specific filtering, but all you are really doing > >>there is to remove genes that are inherently uninteresting no matter > >>what the phenotype of the corresponding sample (if there is no > > > > variation in > > > >>expression for a particular gene across samples then it has no > >>information about the phenotype of the sample). Filtering on low > >>values is probably a bad idea although many do it (and I used to, and > >>still do sometimes depending on the task at hand). > >> > >> > >> Best wishes > >> Robert > >> > >> > >> > >>>Has anyone considered this issue or heard of problems of using a > >>>specific type of filtering rather than a non-specific one? > >>>Thank You for any responses. > >>> > >>>Daniel Bornman > >>>Research Scientist > >>>Battelle Memorial Institute > >>>505 King Ave > >>>Columbus, OH 43201 > >>> > >>>_______________________________________________ > >>>Bioconductor mailing list > >>>Bioconductor at stat.math.ethz.ch > >>>https://stat.ethz.ch/mailman/listinfo/bioconductor > >>> > >> > >> > >>-- > >>Robert Gentleman, PhD > >>Program in Computational Biology > >>Division of Public Health Sciences > >>Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 > >>PO Box 19024 Seattle, Washington 98109-1024 206-667-7700 > >>rgentlem at fhcrc.org > >> > >>_______________________________________________ > >>Bioconductor mailing list > >>Bioconductor at stat.math.ethz.ch > >>https://stat.ethz.ch/mailman/listinfo/bioconductor > >> > > > > > > -- > > Robert Gentleman, PhD > > Program in Computational Biology > > Division of Public Health Sciences > > Fred Hutchinson Cancer Research Center > > 1100 Fairview Ave. N, M2-B876 > > PO Box 19024 > > Seattle, Washington 98109-1024 > > 206-667-7700 > > rgentlem at fhcrc.org > > > >-- >Robert Gentleman, PhD >Program in Computational Biology >Division of Public Health Sciences >Fred Hutchinson Cancer Research Center >1100 Fairview Ave. N, M2-B876 >PO Box 19024 >Seattle, Washington 98109-1024 >206-667-7700 >rgentlem at fhcrc.org > >_______________________________________________ >Bioconductor mailing list >Bioconductor at stat.math.ethz.ch >https://stat.ethz.ch/mailman/listinfo/bioconductor Naomi S. Altman 814-865-3791 (voice) Associate Professor Dept. of Statistics 814-863-7114 (fax) Penn State University 814-865-1348 (Statistics) University Park, PA 16802-2111

ADD REPLY • link 18.2 years ago Naomi Altman ★ 6.0k

0

Entering edit mode

Hi, In substance I agree with Naomi, but I do want to suggest that there are likely to be biases (statistical sense) introduced by filtering on a lack of annotation and I personally would want to deal with that at the end of the analysis, not at the beginning. Not all molecular systems are equally studied, or published on, and if your experiment has intersected with one of these, then pre-filtering will hide that information from you. In some cases this is not a concern, but in others it may be. Of course you can do little with the data if there is no annotation - but even there, you can get the sequence and do some reasonable stuff with that much information these days. On the approach of filtering on variation, I did some simulation studies to convince myself it was not a big problem (with respect to bias), when I first started doing it. You should do your own simulations if you wonder about the effect of different procedures (it is pretty simple). best wishes Robert Naomi Altman wrote: > I think it is unwise to filter based on observed data. This biases the > results. > > On the other hand, filtering on a priori considerations, such as lack of > annotation, should not be a problem. > > --Naomi > > At 12:19 AM 2/21/2006, Robert Gentleman wrote: > > >> Bornman, Daniel M wrote: >> > Robert, >> > >> > After reading your response to my initial question, I do not believe >> you >> > addressed exactly what I attempted to describe. Please pardon me for >> > not being clear. I think your response assumed I was filtering on >> > unadjusted p-values then applying a correction such as Benjamini & >> > Hochberg to a reduced set. >> > >> > My question was rather on the validity of first filtering each gene >> > based on fold-change between two sample groups (i.e. controls vs >> > treated) then calculating a test-statistic, raw p-value and corrected >> > p-value on each gene that passed the fold-change filter. I am worried >> > that using the group phenotype description to filter followed by >> > applying a p-value correction is unfairly reducing my multiple >> > comparison penalty. >> >> It does, and it does not matter how you get there, using one test >> (fold change) to filter and a different test (t-test) for p-value >> correction does not really change the fact that if both tests make use >> of the same way to define samples, then there are problems with the >> interpretation. >> >> > >> > I propose that a less biased approach to fold-filtering would be to >> > filter probes based on the mean of the lower half versus the mean of >> the >> > upper half of expression values at each probe regardless of the >> > phenotype (non-specific). This would surely (except in some instances >> > where a phenotype causes drastic expression changes) cause the >> > fold-filtered set to be larger and thus not unfairly decrease the >> > multiple comparison penalty when computing adjusted p-values. >> > >> >> Well that is one thing, but really IMHO, you are better off filtering >> on variance than some rather arbitrary division into two groups (why 1/2 >> - many of the classification problems I deal with are very unbalanced >> and 1/2 would be a pretty bad choice). And, it is variation that is >> important (whence ANOVA - ANalysis Of VAriance). >> >> Best wishes, >> Robert >> >> > >> > Thank You, >> > Daniel >> > >> > >> > >> > -----Original Message----- >> > From: Robert Gentleman [mailto:rgentlem at fhcrc.org] >> > Sent: Saturday, February 18, 2006 12:57 PM >> > To: Bornman, Daniel M >> > Subject: Re: [BioC] Invalid fold-filter >> > >> > Hi Daniel, >> > I hope not, it is as you have noted a flawed approach. >> > >> > best wishes >> > Robert >> > >> > Bornman, Daniel M wrote: >> > >> >>I of course agree that filtering on a variable (phenotype) that will >> >>be used later to calculate adjusted p-values is flawed and therefore >> >>it is not a method I would implement; however, it seems that many that >> > >> > >> >>describe fold-filtering are doing just that. >> >>Thank you for your response. >> >> >> >>-----Original Message----- >> >>From: Robert Gentleman [mailto:rgentlem at fhcrc.org] >> >>Sent: Friday, February 17, 2006 2:15 PM >> >>To: Bornman, Daniel M >> >>Cc: bioconductor at stat.math.ethz.ch >> >>Subject: Re: [BioC] Invalid fold-filter >> >> >> >> >> >> >> >>Bornman, Daniel M wrote: >> >> >> >> >> >>>Dear BioC Folks, >> >>> >> >>>As a bioinformatician within a Statistics department I often consult >> >>>with real statisticians about the most appropriate test to apply to >> >>>our microarray experiments. One issue that is being debated among our >> >> >> >> >> >>>statisticians is whether some types of fold-filtering may be invalid >> >>>or biased in nature. The types of fold-filtering in question are >> >>>those that tend to NOT be non-specific. >> >>>Some filtering of a 54K probe affy chip is useful prior to making >> >>>decisions on differential expression and there are many examples in >> >>>the Bioconductor documentation (particularly in the {genefilter} >> >>>package) on how to do so. A popular method of non-specific filtering >> >>>for reducing your probeset prior to applying statistics is to filter >> >>>out low expressed probes followed by filtering out probes that do not >> >>>show a minimum difference between quartiles. These two steps are >> >>>non-specific in that they do not take into consideration the actual >> >> >> >>samples/arrays. >> >> >> >> >> >>>On the other hand, if we had two groups of samples, say control versus >> >> >> >> >> >>>treated, and we filtered out those probes that do not have a mean >> >>>difference in expression of 2-fold between the control and treated >> >>>groups, this filtering was based on the actual samples. This is NOT a >> >> >> >> >> >>>non-specific filter. The problem then comes (or rather the debate >> >>>here >> >>>arises) when a t-test is calculated for each probe that passed the >> >>>sample-specific fold-filtering and the p-values are adjusted for >> >>>multiple comparisons by, for example the Benjamini & Hochberg method. >> >>>Is it valid to fold-filter using the sample identity as a criteria >> >>>followed by correcting for multiple comparisons using just those >> >>>probes that made it through the fold-filter? When correcting for >> >>>multiple comparisons you take a penalty for the number of comparison >> >>>you are correcting. The larger the pool of comparisons, the larger >> >>>the penalty, thus the larger the adjusted p-value. Or more >> >>>importantly, the smaller the set, the less your adjusted p-value is >> >>>adjusted (increased) relative to your raw p-value. The argument is >> >>>that you used the actual samples themselves you are comparing to >> >>>unfairly reduce the adjusted p-value penalty. >> >> >> >> >> >> It is not valid to use phenotype to compute t-statistics for a >> >>particular phenotype and filter based on those p-values and to then >> >>use p-value correction methods on the result. I don't think we need >> >>research, it seems pretty obvious that this is not a valid approach. >> >> >> >> You can do non-specific filtering, but all you are really doing >> >>there is to remove genes that are inherently uninteresting no matter >> >>what the phenotype of the corresponding sample (if there is no >> > >> > variation in >> > >> >>expression for a particular gene across samples then it has no >> >>information about the phenotype of the sample). Filtering on low >> >>values is probably a bad idea although many do it (and I used to, and >> >>still do sometimes depending on the task at hand). >> >> >> >> >> >> Best wishes >> >> Robert >> >> >> >> >> >> >> >>>Has anyone considered this issue or heard of problems of using a >> >>>specific type of filtering rather than a non-specific one? >> >>>Thank You for any responses. >> >>> >> >>>Daniel Bornman >> >>>Research Scientist >> >>>Battelle Memorial Institute >> >>>505 King Ave >> >>>Columbus, OH 43201 >> >>> >> >>>_______________________________________________ >> >>>Bioconductor mailing list >> >>>Bioconductor at stat.math.ethz.ch >> >>>https://stat.ethz.ch/mailman/listinfo/bioconductor >> >>> >> >> >> >> >> >>-- >> >>Robert Gentleman, PhD >> >>Program in Computational Biology >> >>Division of Public Health Sciences >> >>Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 >> >>PO Box 19024 Seattle, Washington 98109-1024 206-667-7700 >> >>rgentlem at fhcrc.org >> >> >> >>_______________________________________________ >> >>Bioconductor mailing list >> >>Bioconductor at stat.math.ethz.ch >> >>https://stat.ethz.ch/mailman/listinfo/bioconductor >> >> >> > >> > >> > -- >> > Robert Gentleman, PhD >> > Program in Computational Biology >> > Division of Public Health Sciences >> > Fred Hutchinson Cancer Research Center >> > 1100 Fairview Ave. N, M2-B876 >> > PO Box 19024 >> > Seattle, Washington 98109-1024 >> > 206-667-7700 >> > rgentlem at fhcrc.org >> > >> >> -- >> Robert Gentleman, PhD >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M2-B876 >> PO Box 19024 >> Seattle, Washington 98109-1024 >> 206-667-7700 >> rgentlem at fhcrc.org >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor > > > Naomi S. Altman 814-865-3791 (voice) > Associate Professor > Dept. of Statistics 814-863-7114 (fax) > Penn State University 814-865-1348 (Statistics) > University Park, PA 16802-2111 > > -- Robert Gentleman, PhD Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 PO Box 19024 Seattle, Washington 98109-1024 206-667-7700 rgentlem at fhcrc.org

ADD REPLY • link 18.2 years ago rgentleman ★ 5.5k

0

Entering edit mode

On 2/21/06 12:17, "Robert Gentleman" <rgentlem at="" fhcrc.org=""> wrote: > Hi, > > In substance I agree with Naomi, but I do want to suggest that there > are likely to be biases (statistical sense) introduced by filtering on a > lack of annotation and I personally would want to deal with that at the > end of the analysis, not at the beginning. > > Not all molecular systems are equally studied, or published on, and if > your experiment has intersected with one of these, then pre- filtering > will hide that information from you. In some cases this is not a > concern, but in others it may be. > > Of course you can do little with the data if there is no annotation - > but even there, you can get the sequence and do some reasonable stuff > with that much information these days. I would second this sentiment. Genome annotation is a moving target. A probe or probeset that represents a gene one day may not the next or may represent a different one. In addition, microarray manufacturers typically focus on one or two sets of genomic annotation (for example, the RefSeq set from NCBI); there are multiple other sets of genome annotation that may be more inclusive or have different coverage of the "full" set of transcripts. While it may be prohibitively complicated and time-consuming for many labs to blast every probe or consensus sequence to all known transcripts (and even genbank) a priori (although that is what our lab does, in practice), once a gene list is available that includes an "EST" or anonymous sequence, it is often quite enlightening to look at blast results against various genome annotations. Often, these probes represent a gene family or some highly-conserved domain; while this isn't necessarily useful information in-and-of itself, in a particular biologic context or gene set, having such information may just as hypothesis-generating as probes that "cleanly" represent a gene. Sean

ADD REPLY • link 18.2 years ago Sean Davis 21k

0

Entering edit mode

Stephen Henderson ★ 1.0k

@stephen-henderson-71

Last seen 7.0 years ago

Yes removing genes like that is wrong. The second method you mention splits the data and will probably also let bias slip in the back-door. People often select coefficient of variance (sd/mean) -- as the sd of data should increase linearly with the mean. It doesn't with array data so I sometimes use the moderated sd (sigma if I remember rightly) from limma eBayes(fit) and use that to calculate a cv filter. Stephen Henderson Wolfson Inst. for Biomedical Research Cruciform Bldg., Gower Street University College London United Kingdom, WC1E 6BT +44 (0)207 679 6827 -----Original Message----- From: bioconductor-bounces@stat.math.ethz.ch [mailto:bioconductor-bounces at stat.math.ethz.ch] On Behalf Of Bornman, Daniel M Sent: 20 February 2006 13:34 To: Robert Gentleman Cc: bioconductor at stat.math.ethz.ch Subject: Re: [BioC] Invalid fold-filter Robert, After reading your response to my initial question, I do not believe you addressed exactly what I attempted to describe. Please pardon me for not being clear. I think your response assumed I was filtering on unadjusted p-values then applying a correction such as Benjamini & Hochberg to a reduced set. My question was rather on the validity of first filtering each gene based on fold-change between two sample groups (i.e. controls vs treated) then calculating a test-statistic, raw p-value and corrected p-value on each gene that passed the fold-change filter. I am worried that using the group phenotype description to filter followed by applying a p-value correction is unfairly reducing my multiple comparison penalty. I propose that a less biased approach to fold-filtering would be to filter probes based on the mean of the lower half versus the mean of the upper half of expression values at each probe regardless of the phenotype (non-specific). This would surely (except in some instances where a phenotype causes drastic expression changes) cause the fold-filtered set to be larger and thus not unfairly decrease the multiple comparison penalty when computing adjusted p-values. Thank You, Daniel -----Original Message----- From: Robert Gentleman [mailto:rgentlem@fhcrc.org] Sent: Saturday, February 18, 2006 12:57 PM To: Bornman, Daniel M Subject: Re: [BioC] Invalid fold-filter Hi Daniel, I hope not, it is as you have noted a flawed approach. best wishes Robert Bornman, Daniel M wrote: > I of course agree that filtering on a variable (phenotype) that will > be used later to calculate adjusted p-values is flawed and therefore > it is not a method I would implement; however, it seems that many that > describe fold-filtering are doing just that. > Thank you for your response. > > -----Original Message----- > From: Robert Gentleman [mailto:rgentlem at fhcrc.org] > Sent: Friday, February 17, 2006 2:15 PM > To: Bornman, Daniel M > Cc: bioconductor at stat.math.ethz.ch > Subject: Re: [BioC] Invalid fold-filter > > > > Bornman, Daniel M wrote: > >>Dear BioC Folks, >> >>As a bioinformatician within a Statistics department I often consult >>with real statisticians about the most appropriate test to apply to >>our microarray experiments. One issue that is being debated among our > > >>statisticians is whether some types of fold-filtering may be invalid >>or biased in nature. The types of fold-filtering in question are >>those that tend to NOT be non-specific. >>Some filtering of a 54K probe affy chip is useful prior to making >>decisions on differential expression and there are many examples in >>the Bioconductor documentation (particularly in the {genefilter} >>package) on how to do so. A popular method of non-specific filtering >>for reducing your probeset prior to applying statistics is to filter >>out low expressed probes followed by filtering out probes that do not >>show a minimum difference between quartiles. These two steps are >>non-specific in that they do not take into consideration the actual > > samples/arrays. > >>On the other hand, if we had two groups of samples, say control versus > > >>treated, and we filtered out those probes that do not have a mean >>difference in expression of 2-fold between the control and treated >>groups, this filtering was based on the actual samples. This is NOT a > > >>non-specific filter. The problem then comes (or rather the debate >>here >>arises) when a t-test is calculated for each probe that passed the >>sample-specific fold-filtering and the p-values are adjusted for >>multiple comparisons by, for example the Benjamini & Hochberg method. >>Is it valid to fold-filter using the sample identity as a criteria >>followed by correcting for multiple comparisons using just those >>probes that made it through the fold-filter? When correcting for >>multiple comparisons you take a penalty for the number of comparison >>you are correcting. The larger the pool of comparisons, the larger >>the penalty, thus the larger the adjusted p-value. Or more >>importantly, the smaller the set, the less your adjusted p-value is >>adjusted (increased) relative to your raw p-value. The argument is >>that you used the actual samples themselves you are comparing to >>unfairly reduce the adjusted p-value penalty. > > > It is not valid to use phenotype to compute t-statistics for a > particular phenotype and filter based on those p-values and to then > use p-value correction methods on the result. I don't think we need > research, it seems pretty obvious that this is not a valid approach. > > You can do non-specific filtering, but all you are really doing > there is to remove genes that are inherently uninteresting no matter > what the phenotype of the corresponding sample (if there is no variation in > expression for a particular gene across samples then it has no > information about the phenotype of the sample). Filtering on low > values is probably a bad idea although many do it (and I used to, and > still do sometimes depending on the task at hand). > > > Best wishes > Robert > > >>Has anyone considered this issue or heard of problems of using a >>specific type of filtering rather than a non-specific one? >>Thank You for any responses. >> >>Daniel Bornman >>Research Scientist >>Battelle Memorial Institute >>505 King Ave >>Columbus, OH 43201 >> >>_______________________________________________ >>Bioconductor mailing list >>Bioconductor at stat.math.ethz.ch >>https://stat.ethz.ch/mailman/listinfo/bioconductor >> > > > -- > Robert Gentleman, PhD > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 > PO Box 19024 Seattle, Washington 98109-1024 206-667-7700 > rgentlem at fhcrc.org > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > -- Robert Gentleman, PhD Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 PO Box 19024 Seattle, Washington 98109-1024 206-667-7700 rgentlem at fhcrc.org _______________________________________________ Bioconductor mailing list Bioconductor at stat.math.ethz.ch https://stat.ethz.ch/mailman/listinfo/bioconductor ********************************************************************** This email and any files transmitted with it are confidentia...{{dropped}}

ADD COMMENT • link 18.2 years ago Stephen Henderson ★ 1.0k

Login before adding your answer.