Filtering before differential expression analysis of microarrays

Filtering before differential expression analysis of microarrays - New paper out

0

Entering edit mode

Daniel Brewer ★ 1.9k

@daniel-brewer-1791

Last seen 9.6 years ago

Hi, There is a new paper out at BMC bioinformatics that seems to justify the use of filtering before differential expression analysis is performed (Hackstadt & Hess BMC Bioinformatics 2009, 10:11 - http://www.biomedcentral.com/1471-2105/10/11/abstract). Specifically filtering by variance and detection call. I have got the impression from this list that the general opinion is that one should only filter out the control genes before testing. I was wondering if anyone had any opinions on this paper and the topic in general. Many thanks Dan -- ************************************************************** Daniel Brewer, Ph.D. Institute of Cancer Research Molecular Carcinogenesis Email: daniel.brewer at icr.ac.uk ************************************************************** The Institute of Cancer Research: Royal Cancer Hospital, a charitable Company Limited by Guarantee, Registered in England under Company No. 534147 with its Registered Office at 123 Old Brompton Road, London SW7 3RP. This e-mail message is confidential and for use by the a...{{dropped:2}}

Cancer Cancer • 3.0k views

ADD COMMENT • link updated 15.3 years ago by Gordon Smyth 50k • written 15.3 years ago by Daniel Brewer ★ 1.9k

0

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 2 hours ago

United States

Hi Dan, Daniel Brewer wrote: > Hi, > > There is a new paper out at BMC bioinformatics that seems to justify the > use of filtering before differential expression analysis is performed > (Hackstadt & Hess BMC Bioinformatics 2009, 10:11 - > http://www.biomedcentral.com/1471-2105/10/11/abstract). Specifically > filtering by variance and detection call. I have got the impression > from this list that the general opinion is that one should only filter > out the control genes before testing. I was wondering if anyone had any > opinions on this paper and the topic in general. I'm sure people do have opinions about this topic ;-D The reason people have so many opinions is because it isn't a simple question, and it depends on what you consider important. If you are just trying to limit the number of multiple comparisons to increase power, then filtering first is probably the way to go. If you are concerned with the accuracy of the FDR estimates, then filtering first may not be ideal. If you are using limma (Hackstadt and Hess used multtest), then you should filter after the eBayes step but before the FDR step, as an assumption of the eBayes step is that all of the data from the chip are available. Unless of course you are concerned about the accuracy of the FDR estimates, in which case... well you see the point. With microarray data analysis the arguments for and against a particular way of doing things can shed more heat than light, as nobody really knows the underlying truth, and the measures we use are really far removed from the actual phenomenon we are testing. Best, Jim > > Many thanks > > Dan > -- James W. MacDonald, M.S. Biostatistician Hildebrandt Lab 8220D MSRB III 1150 W. Medical Center Drive Ann Arbor MI 48109-5646 734-936-8662

ADD COMMENT • link 15.3 years ago James W. MacDonald 65k

0

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 1 hour ago

WEHI, Melbourne, Australia

Dear Dan, It's very common practice to keep all the probes for normalization, then to filter control probes and consistently non-expressed probes before differential expression analysis. I recommend and do it this myself. It's such common practice that it's surprising to see a paper on it at this stage. It is in the spirit of normalization methods that all probes should be retained for normalization, except in unusual cases in which some probes are obviously poor quality for reasons other than expression level. At the differential expression step, probes can be usefully filtered out if they are not of any potential interest. This means control probes, or probes which appear to be non-expressed across all conditions in the experiment, i.e., on all arrays. I have frequently complained on this mailing list about the practice of filtering individual low intensity probes on individual arrays, which IMO is a very destructive practice. If you filter a probe on the basis of expression, it must be filtered on all arrays. Filtering non-expressed probes tends not be emphasised on this list because users of this list are often sophisticated enough to use variance stabilizing normalization methods such as rma, vsn, normexp or vst. This means that low-expression filtering is done more for multiplicity issues than for variance stabilization, and therefore often doesn't make a huge difference. When using earlier normalization methods such as MAS for Affy or local background correction for two-color arrays, expression-filtering is absolutely essential, because the normalized expression values are so unstable at low intensity levels. To James, it is not necessary to give retain all the probes on the array for eBayes(). The only requirement is that eBayes() sees all the probes which are under consideration for differential expression. So filtering out consistently non-expressed probes before linear modelling is generally a good idea. In fact, filtering often improves the eBayes() assumptions. eBayes assumes that the residual variances are not intensity-dependent. However very lowly expressed probes often follow a mean-variance relationship which is somewhat different from the other probes, even after variance stabilization, in which case filtering will improve the constancy of variance assumption. This tends not to be a big issue with rma-Affy data, but it is an important issue with vst-Illumina data for example. Best wishes Gordon >Date: Mon, 12 Jan 2009 09:25:02 -0500 >From: "James W. MacDonald" <jmacdon at="" med.umich.edu=""> >Subject: Re: [BioC] Filtering before differential expression analysis > of microarrays - New paper out >To: Daniel Brewer <daniel.brewer at="" icr.ac.uk=""> >Cc: bioconductor at stat.math.ethz.ch > >Hi Dan, > >Daniel Brewer wrote: >>Hi, >> >>There is a new paper out at BMC bioinformatics that seems to justify the >>use of filtering before differential expression analysis is performed >>(Hackstadt & Hess BMC Bioinformatics 2009, 10:11 - >>http://www.biomedcentral.com/1471-2105/10/11/abstract). Specifically >>filtering by variance and detection call. I have got the impression >>from this list that the general opinion is that one should only filter >>out the control genes before testing. I was wondering if anyone had any >>opinions on this paper and the topic in general. > >I'm sure people do have opinions about this topic ;-D > >The reason people have so many opinions is because it isn't a simple >question, and it depends on what you consider important. > >If you are just trying to limit the number of multiple comparisons to >increase power, then filtering first is probably the way to go. > >If you are concerned with the accuracy of the FDR estimates, then >filtering first may not be ideal. > >If you are using limma (Hackstadt and Hess used multtest), then you >should filter after the eBayes step but before the FDR step, as an >assumption of the eBayes step is that all of the data from the chip are >available. > >Unless of course you are concerned about the accuracy of the FDR >estimates, in which case... well you see the point. > >With microarray data analysis the arguments for and against a particular >way of doing things can shed more heat than light, as nobody really >knows the underlying truth, and the measures we use are really far >removed from the actual phenomenon we are testing. > >Best, > >Jim > > >> >>Many thanks >> >>Dan > >-- >James W. MacDonald, M.S. >Biostatistician >Hildebrandt Lab >8220D MSRB III >1150 W. Medical Center Drive >Ann Arbor MI 48109-5646 >734-936-8662

ADD COMMENT • link 15.3 years ago Gordon Smyth 50k

0

Entering edit mode

Hi Gordon, As someone who has been dealing more and more with raw data, I always appreciate detailed answers from the masters, such as the one you just wrote. Even after reading several of the published articles regarding these normalization practices, I always find these less formal emails quite helpful. That said, one point you mention isn't exactly clear to me, and I'm wondering if you could elaborate just a bit here: > Filtering non-expressed probes tends not be emphasised on this list > because users of this list are often sophisticated enough to use > variance stabilizing normalization methods such as rma, vsn, normexp > or vst. This means that low-expression filtering is done more for > multiplicity issues than for variance stabilization, and therefore > often doesn't make a huge difference. When using earlier > normalization methods such as MAS for Affy or local background > correction for two-color arrays, expression-filtering is absolutely > essential, because the normalized expression values are so unstable > at low intensity levels. When you say "... low-expression filtering is done more for multiplicity issues than for variance stabilization", what exactly do you mean by "multiplicity issues"? Thanks, -steve -- Steve Lianoglou Graduate Student: Physiology, Biophysics and Systems Biology Weill Medical College of Cornell University http://cbio.mskcc.org/~lianos

ADD REPLY • link 15.3 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Hi Steve, The question wasn't really asked of me, but Gordon is likely in bed right now ;-D Steve Lianoglou wrote: > Hi Gordon, > > As someone who has been dealing more and more with raw data, I always > appreciate detailed answers from the masters, such as the one you just > wrote. Even after reading several of the published articles regarding > these normalization practices, I always find these less formal emails > quite helpful. > > That said, one point you mention isn't exactly clear to me, and I'm > wondering if you could elaborate just a bit here: > >> Filtering non-expressed probes tends not be emphasised on this list >> because users of this list are often sophisticated enough to use >> variance stabilizing normalization methods such as rma, vsn, normexp >> or vst. This means that low-expression filtering is done more for >> multiplicity issues than for variance stabilization, and therefore >> often doesn't make a huge difference. When using earlier >> normalization methods such as MAS for Affy or local background >> correction for two-color arrays, expression-filtering is absolutely >> essential, because the normalized expression values are so unstable at >> low intensity levels. > > > When you say "... low-expression filtering is done more for multiplicity > issues than for variance stabilization", what exactly do you mean by > "multiplicity issues"? By multiplicity issues Gordon was referring to the multiple comparisons problem. A p-value is an estimate of the probability of a type 1 error, in which we say there is a difference when in fact there isn't (a false positive). If we reject the null hypothesis at an alpha level of 0.05, we are in essence taking a 5% chance of being wrong. For one test this isn't a problem, but as you make more and more tests simultaneously, you expect to see more and more false positives (e.g, if you do 20 tests at an alpha of 0.05, and there are really no differences for any of the tests, you still expect about one of them to appear significant even though none are). There are lots of ways to adjust for multiple comparisons, but one of the best things you can do is not make so many comparisons in the first place, by filtering out data based on one or more criteria. Best, Jim > > Thanks, > -steve > > -- > Steve Lianoglou > Graduate Student: Physiology, Biophysics and Systems Biology > Weill Medical College of Cornell University > > http://cbio.mskcc.org/~lianos > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician Hildebrandt Lab 8220D MSRB III 1150 W. Medical Center Drive Ann Arbor MI 48109-5646 734-936-8662

ADD REPLY • link 15.3 years ago James W. MacDonald 65k

0

Entering edit mode

Thanks for the brilliant answer. Very interesting stuff. The only other question I would like to ask concerning this is when do you define a probe as non-expressed? Is this done by observation of some kind of plot e.g. MA plot, a fixed percentage of probes or some absolute value known by experience. For Affy arrays you can use the DaBG results but I am not sure what the correct approach would be with two colour microarrays. Many thanks Dan Gordon Smyth wrote: > Dear Dan, > > It's very common practice to keep all the probes for normalization, then > to filter control probes and consistently non-expressed probes before > differential expression analysis. I recommend and do it this myself. > It's such common practice that it's surprising to see a paper on it at > this stage. > > It is in the spirit of normalization methods that all probes should be > retained for normalization, except in unusual cases in which some probes > are obviously poor quality for reasons other than expression level. > > At the differential expression step, probes can be usefully filtered out > if they are not of any potential interest. This means control probes, > or probes which appear to be non-expressed across all conditions in the > experiment, i.e., on all arrays. I have frequently complained on this > mailing list about the practice of filtering individual low intensity > probes on individual arrays, which IMO is a very destructive practice. > If you filter a probe on the basis of expression, it must be filtered on > all arrays. > > Filtering non-expressed probes tends not be emphasised on this list > because users of this list are often sophisticated enough to use > variance stabilizing normalization methods such as rma, vsn, normexp or > vst. This means that low-expression filtering is done more for > multiplicity issues than for variance stabilization, and therefore often > doesn't make a huge difference. When using earlier normalization > methods such as MAS for Affy or local background correction for > two-color arrays, expression-filtering is absolutely essential, because > the normalized expression values are so unstable at low intensity levels. > > To James, it is not necessary to give retain all the probes on the array > for eBayes(). The only requirement is that eBayes() sees all the probes > which are under consideration for differential expression. So filtering > out consistently non-expressed probes before linear modelling is > generally a good idea. In fact, filtering often improves the eBayes() > assumptions. eBayes assumes that the residual variances are not > intensity-dependent. However very lowly expressed probes often follow a > mean-variance relationship which is somewhat different from the other > probes, even after variance stabilization, in which case filtering will > improve the constancy of variance assumption. This tends not to be a > big issue with rma-Affy data, but it is an important issue with > vst-Illumina data for example. > > Best wishes > Gordon -- ************************************************************** Daniel Brewer, Ph.D. Institute of Cancer Research Molecular Carcinogenesis MUCRC 15 Cotswold Road Sutton, Surrey SM2 5NG United Kingdom Tel: +44 (0) 20 8722 4109 Email: daniel.brewer at icr.ac.uk ************************************************************** The Institute of Cancer Research: Royal Cancer Hospital, a charitable Company Limited by Guarantee, Registered in England under Company No. 534147 with its Registered Office at 123 Old Brompton Road, London SW7 3RP. This e-mail message is confidential and for use by the a...{{dropped:2}}

ADD REPLY • link 15.3 years ago Daniel Brewer ★ 1.9k

0

Entering edit mode

Hi, My preference is not to get into the discussion of expressed or non-expressed as those terms relate to the mRNA, and all we see is whether something stuck to a particular spot (which we think we know the sequence of). And also that is somewhat irrelevant to most questions (not all though). A somewhat simpler criteria is whether or not the spot/gene varies across samples. If it does not, then it does not have any information about any phenotype and it is not clear that you would be interested in modeling with it. Certainly all non-expressed genes (modulo artifacts) will have relatively constant expression. best wishes Robert On Thu, Jan 15, 2009 at 2:49 AM, Daniel Brewer <daniel.brewer@icr.ac.uk>wrote: > Thanks for the brilliant answer. Very interesting stuff. The only > other question I would like to ask concerning this is when do you define > a probe as non-expressed? Is this done by observation of some kind of > plot e.g. MA plot, a fixed percentage of probes or some absolute value > known by experience. For Affy arrays you can use the DaBG results but I > am not sure what the correct approach would be with two colour microarrays. > > Many thanks > > Dan > > Gordon Smyth wrote: > > Dear Dan, > > > > It's very common practice to keep all the probes for normalization, then > > to filter control probes and consistently non-expressed probes before > > differential expression analysis. I recommend and do it this myself. > > It's such common practice that it's surprising to see a paper on it at > > this stage. > > > > It is in the spirit of normalization methods that all probes should be > > retained for normalization, except in unusual cases in which some probes > > are obviously poor quality for reasons other than expression level. > > > > At the differential expression step, probes can be usefully filtered out > > if they are not of any potential interest. This means control probes, > > or probes which appear to be non-expressed across all conditions in the > > experiment, i.e., on all arrays. I have frequently complained on this > > mailing list about the practice of filtering individual low intensity > > probes on individual arrays, which IMO is a very destructive practice. > > If you filter a probe on the basis of expression, it must be filtered on > > all arrays. > > > > Filtering non-expressed probes tends not be emphasised on this list > > because users of this list are often sophisticated enough to use > > variance stabilizing normalization methods such as rma, vsn, normexp or > > vst. This means that low-expression filtering is done more for > > multiplicity issues than for variance stabilization, and therefore often > > doesn't make a huge difference. When using earlier normalization > > methods such as MAS for Affy or local background correction for > > two-color arrays, expression-filtering is absolutely essential, because > > the normalized expression values are so unstable at low intensity levels. > > > > To James, it is not necessary to give retain all the probes on the array > > for eBayes(). The only requirement is that eBayes() sees all the probes > > which are under consideration for differential expression. So filtering > > out consistently non-expressed probes before linear modelling is > > generally a good idea. In fact, filtering often improves the eBayes() > > assumptions. eBayes assumes that the residual variances are not > > intensity-dependent. However very lowly expressed probes often follow a > > mean-variance relationship which is somewhat different from the other > > probes, even after variance stabilization, in which case filtering will > > improve the constancy of variance assumption. This tends not to be a > > big issue with rma-Affy data, but it is an important issue with > > vst-Illumina data for example. > > > > Best wishes > > Gordon > > -- > ************************************************************** > > Daniel Brewer, Ph.D. > > Institute of Cancer Research > Molecular Carcinogenesis > MUCRC > 15 Cotswold Road > Sutton, Surrey SM2 5NG > United Kingdom > > Tel: +44 (0) 20 8722 4109 > > Email: daniel.brewer@icr.ac.uk > > ************************************************************** > > The Institute of Cancer Research: Royal Cancer Hospital, a charitable > Company Limited by Guarantee, Registered in England under Company No. 534147 > with its Registered Office at 123 Old Brompton Road, London SW7 3RP. > > This e-mail message is confidential and for use by the a...{{dropped:2}} > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > -- Robert Gentleman, PhD Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 PO Box 19024 Seattle, Washington 98109-1024 206-667-7700 rgentlem@fhcrc.org [[alternative HTML version deleted]]

ADD REPLY • link 15.3 years ago rgentleman ★ 5.5k

Login before adding your answer.