Genefilter parameters for mouse 430 2
1
0
Entering edit mode
@richard-friedman-513
Last seen 9.6 years ago
Dear Bioconductor Users, I am using genefilter to filter an ExpressionSet of 4 Mouse 430 2 chips preprocessed with gcrma prior to analysis with limma. Here is a description of the expressionset. > xen2dataeset ExpressionSet (storageMode: lockedEnvironment) assayData: 45101 features, 4 samples element names: exprs phenoData sampleNames: A_xen_1_21.cel, A_xen_2_22.cel, D_nodal_1_27.cel, D_nodal_2_2 8.cel varLabels and varMetadata description: sample: arbitrary numbering featureData featureNames: 1415670_at, 1415671_at, ..., AFFX-r2-P1-cre-5_at (45101 total) fvarLabels and fvarMetadata description: none experimentData: use 'experimentData(object)' Annotation: mouse4302 > Here is my session information. > sessionInfo() R version 2.6.1 (2007-11-26) i386-apple-darwin8.10.1 locale: en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] splines stats graphics grDevices utils datasets methods [8] base other attached packages: [1] mouse4302probe_2.0.0 mouse4302cdf_2.0.0 mouse4302.db_2.0.2 [4] limma_2.12.0 geneplotter_1.16.0 lattice_0.17-2 [7] annotate_1.16.1 AnnotationDbi_1.0.6 RSQLite_0.6-3 [10] DBI_0.2-3 RColorBrewer_1.0-1 affyPLM_1.14.0 [13] xtable_1.5-2 simpleaffy_2.14.05 gcrma_2.10.0 [16] matchprobes_1.10.0 genefilter_1.16.0 survival_2.34 [19] annaffy_1.10.1 KEGG_2.0.1 GO_2.0.1 [22] affy_1.16.0 preprocessCore_1.0.0 affyio_1.6.1 [25] Biobase_1.16.3 loaded via a namespace (and not attached): [1] KernSmooth_2.22-21 grid_2.6.1 tools_2.6.1 > I have tried the filtering parameters in the article by Scholtens and Heydebreck on p 233 of the book by Gentleman et al.: f1<-pOverA(0.25,log2(100)) > f2<-function(x)(IQR(x)>0.5) > ff<-filterfun(f1,f2) > selected <-genefilter(xen2dataeset,ff) > sum(selected) [1] 289 This seemed a bit small so that I tried the effect of each of the parameters individually: selectedp025A <-genefilter(xen2dataeset,f1) > sum(selectedp025A) [1] 9681 > selectedIQRgtp5 <-genefilter(xen2dataeset,f2) > sum(selectedIQRgtp5) [1] 731 My questions; 1. Is the log2(100) intensity cutoff good for all chips? If not can someone recommend a good intensity cutoff for mouse 4302. 2, Is the only effect of filtering to reduce the multiplier in the false discovery analysis OR does it reduce false positives in other ways by A. In the case of intensity filters by reducing the number of large fold changes resulting from the ratios of small numbers. B. In the case of IQR filters eliminating large t-statistics resulting for genes with small variation across samples but fortuitously low standard deviations, Up until this time I have not filtered because the filtering parameters looked arbitrary and I thought that it was cheating to reduce the # of tests used to compute the FDR. From reading and further reflection I now believe otherwise. But whereas I now believe I should filter I am not at all sure what parameters to use, and how much my final list of differentially expressed genes will be sensitive to a choice of those parameters. In particular, i wonder if the intensity filter cutoff should vary with chip-type and preprocessing method (eg GCRMA). Any thoughts and guidance would be appreciated. Thanks as always, Rich ------------------------------------------------------------ Richard A. Friedman, PhD Biomedical Informatics Shared Resource Herbert Irving Comprehensive Cancer Center (HICCC) Lecturer Department of Biomedical Informatics (DBMI) Educational Coordinator Center for Computational Biology and Bioinformatics (C2B2) National Center for Multiscale Analysis of Genomic Networks (MAGNet) Box 95, Room 130BB or P&S 1-420C Columbia University Medical Center 630 W. 168th St. New York, NY 10032 (212)305-6901 (5-6901) (voice) friedman at cancercenter.columbia.edu http://cancercenter.columbia.edu/~friedman/ "Sure I am willing to stop watching television to get a better education." -Rose Friedman, age 11
Cancer mouse4302 genefilter limma gcrma Cancer mouse4302 genefilter limma gcrma • 1.3k views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 12 hours ago
United States
Hi Richard, Richard Friedman wrote: > Dear Bioconductor Users, > > I am using genefilter to filter an ExpressionSet of 4 Mouse 430 2 chips > preprocessed with gcrma prior to analysis with limma. > > Here is a description of the expressionset. > > > xen2dataeset > ExpressionSet (storageMode: lockedEnvironment) > assayData: 45101 features, 4 samples > element names: exprs > phenoData > sampleNames: A_xen_1_21.cel, A_xen_2_22.cel, D_nodal_1_27.cel, > D_nodal_2_2 > 8.cel > varLabels and varMetadata description: > sample: arbitrary numbering > featureData > featureNames: 1415670_at, 1415671_at, ..., AFFX-r2-P1-cre-5_at > (45101 total) > fvarLabels and fvarMetadata description: none > experimentData: use 'experimentData(object)' > Annotation: mouse4302 > > > > Here is my session information. > > > sessionInfo() > R version 2.6.1 (2007-11-26) > i386-apple-darwin8.10.1 > > locale: > en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] splines stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] mouse4302probe_2.0.0 mouse4302cdf_2.0.0 mouse4302.db_2.0.2 > [4] limma_2.12.0 geneplotter_1.16.0 lattice_0.17-2 > [7] annotate_1.16.1 AnnotationDbi_1.0.6 RSQLite_0.6-3 > [10] DBI_0.2-3 RColorBrewer_1.0-1 affyPLM_1.14.0 > [13] xtable_1.5-2 simpleaffy_2.14.05 gcrma_2.10.0 > [16] matchprobes_1.10.0 genefilter_1.16.0 survival_2.34 > [19] annaffy_1.10.1 KEGG_2.0.1 GO_2.0.1 > [22] affy_1.16.0 preprocessCore_1.0.0 affyio_1.6.1 > [25] Biobase_1.16.3 > > loaded via a namespace (and not attached): > [1] KernSmooth_2.22-21 grid_2.6.1 tools_2.6.1 > > > > > I have tried the filtering parameters in the article by Scholtens and > Heydebreck on > p 233 of the book by Gentleman et al.: > > f1<-pOverA(0.25,log2(100)) > > f2<-function(x)(IQR(x)>0.5) > > ff<-filterfun(f1,f2) > > selected <-genefilter(xen2dataeset,ff) > > sum(selected) > [1] 289 > > This seemed a bit small so that I tried the effect of each of the > parameters individually: > > selectedp025A <-genefilter(xen2dataeset,f1) > > sum(selectedp025A) > [1] 9681 > > selectedIQRgtp5 <-genefilter(xen2dataeset,f2) > > sum(selectedIQRgtp5) > [1] 731 > > My questions; > > 1. Is the log2(100) intensity cutoff good for all chips? > If not can someone recommend a good intensity cutoff for mouse 4302. That depends. If you are using rma(), then no ;-P Seriously, this depends on the data in hand. If you have some really dim chips then maybe it is too high. The problem with filtering is that it can be pretty ad hoc, so it's difficult to come up with a hard and fast rule. You might try something like eset2 <- nsFilter(eset)$eset and see how many probesets you end up with. > 2, Is the only effect of filtering to reduce the multiplier in the > false discovery > analysis OR does it reduce false positives in other ways by > A. In the case of intensity filters by reducing the number of large > fold changes resulting > from the ratios of small numbers. > B. In the case of IQR filters eliminating large t-statistics > resulting for genes with small variation > across samples but fortuitously low standard deviations, Yes and yes, to a certain extent. If you are just doing fold changes, you might consider filtering on each fold change rather than overall. For instance you could create a filter filt <- filterfun(kOverA(1, 100)) that you would then use for each fold change comparison to ensure that at least one of the samples had an expression > 100. Shameless plug - see foldFilt() in affycoretools. If you are doing t-stats with a very small number of replicates (like 2 vs 2), then you should be using limma, and in which case over- filtering the data can be detrimental as well. The reason for that is the prior will be estimated on all the probesets that remain, and if all you have are highly variable probesets then the prior will be larger than you might want. I have seen cases with very small numbers of replicates where using all the data on the chip resulted in many more significant probesets than if I did what I thought was a reasonable filter. Of course the question remains; is more better? And if more is better, does that mean the ideal would be to find all probesets differentially expressed? Probably not, so we are back to the usual prescriptions; check your data carefully. Make sure your results are sensible. Do EDA to ensure that you don't have some wacky chip messing things up. Check your code to be sure that you haven't made the kind of errors that I like to make. Consult with the experimenter to see if very few genes should be changing (or be expressed at all). Best, Jim > > Up until this time I have not filtered because the filtering > parameters looked arbitrary and I > thought that it was cheating to reduce the # of tests used to compute > the FDR. From reading and > further reflection I now believe otherwise. But whereas I now believe > I should filter I am > not at all sure what parameters to use, and how much my final list of > differentially expressed genes > will be sensitive to a choice of those parameters. In particular, i > wonder if the > intensity filter cutoff should vary with chip-type and preprocessing > method (eg GCRMA). > > Any thoughts and guidance would be appreciated. > > Thanks as always, > Rich > ------------------------------------------------------------ > Richard A. Friedman, PhD > Biomedical Informatics Shared Resource > Herbert Irving Comprehensive Cancer Center (HICCC) > Lecturer > Department of Biomedical Informatics (DBMI) > Educational Coordinator > Center for Computational Biology and Bioinformatics (C2B2) > National Center for Multiscale Analysis of Genomic Networks (MAGNet) > Box 95, Room 130BB or P&S 1-420C > Columbia University Medical Center > 630 W. 168th St. > New York, NY 10032 > (212)305-6901 (5-6901) (voice) > friedman at cancercenter.columbia.edu > http://cancercenter.columbia.edu/~friedman/ > > "Sure I am willing to stop watching television > to get a better education." > -Rose Friedman, age 11 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician Affymetrix and cDNA Microarray Core University of Michigan Cancer Center 1500 E. Medical Center Drive 7410 CCGC Ann Arbor MI 48109 734-647-5623
ADD COMMENT
0
Entering edit mode
Jim, Thank you for your detailed and helpful reply. On Mar 19, 2008, at 4:52 PM, James W. MacDonald wrote: > > That depends. If you are using rma(), then no ;-P what about gcrma. > > You might try something like > > eset2 <- nsFilter(eset)$eset > > and see how many probesets you end up with. I have tried > xen2nsSUB<-nsFilter(xen2dataeset)$xen2dataeset > sum(xen2nsSUB) [1] 0 > xen2nsSUB NULL I then tried (which I think is the correct way to do it)/ > xen2nsSUB2<-nsFilter(xen2dataeset) > xen2nsSUB2 $eset ExpressionSet (storageMode: lockedEnvironment) assayData: 623 features, 4 samples element names: exprs phenoData sampleNames: A_xen_1_21.cel, A_xen_2_22.cel, D_nodal_1_27.cel, D_nodal_2_2 8.cel varLabels and varMetadata description: sample: arbitrary numbering featureData featureNames: 1428670_at, 1457283_at, ..., 1426849_at (623 total) fvarLabels and fvarMetadata description: none experimentData: use 'experimentData(object)' Annotation: mouse4302 $filter.log $filter.log$numDupsRemoved [1] 77 $filter.log$numLowVar [1] 41507 $filter.log$feature.exclude [1] 13 $filter.log$numRemoved.ENTREZID [1] 2881 > >> 2, Is the only effect of filtering to reduce the multiplier in >> the false discovery >> analysis OR does it reduce false positives in other ways by >> A. In the case of intensity filters by reducing the number of >> large fold changes resulting >> from the ratios of small numbers. >> B. In the case of IQR filters eliminating large t-statistics >> resulting for genes with small variation >> across samples but fortuitously low standard deviations, > > Yes and yes, to a certain extent. If you are just doing fold > changes, you might consider filtering on each fold change rather > than overall. For instance you could create a filter > > filt <- filterfun(kOverA(1, 100)) > > that you would then use for each fold change comparison to ensure > that at least one of the samples had an expression > 100. Shameless > plug - see foldFilt() in affycoretools. I think that that is basically what I did with genefilter pOverA (0.25,log2(100) described in my first note (.25 of 4 =1). Or am I getting somehing wrong. > > If you are doing t-stats with a very small number of replicates > (like 2 vs 2), then you should be using limma, and in which case > over-filtering the data can be detrimental as well. The reason for > that is the prior will be estimated on all the probesets that > remain, and if all you have are highly variable probesets then the > prior will be larger than you might want. I have seen cases with > very small numbers of replicates where using all the data on the > chip resulted in many more significant probesets than if I did what > I thought was a reasonable filter. > I am using Limma. I asked for at least 3 sreplicates from the experimentalist but she only gave me 2 (story of my life). I got 731 with just the variance filter, 289 with the log2 filter, and 619 with nsFilter. How many probesets do I need for limma to function properly? Do the above numbers seem to small. > Of course the question remains; is more better? > And if more is better, does that mean the ideal would be to find > all probesets differentially expressed? Probably not, so we are > back to the usual prescriptions; check your data carefully. Make > sure your results are sensible. Do EDA to ensure that you don't > have some wacky chip messing things up. What is EDA? I did all of the quality measures in simpleaffy and in AffyPLM and the chips look fine. > Check your code to be sure that you haven't made the kind of errors > that I like to make. Consult with the experimenter to see if very > few genes should be changing (or be expressed at all). I have done both of these things. Here is my present understanding of the situation following your note: 1. Filter by variance and other nsFilter parmaeters is good unless it leads to too few probestes. Q1. How many are too small? Q2. It is advisable to use an intensity filter but log2(100) on at least one chip is too high on dim chips? Q2B. Is there a way to quantify dimness so I knwo how to adjust the cutoff? Q2C. Or am I better off filtering only on variance, or variance plus the nsFilter defaults. THANKS! Rich > > Best, > > Jim > > >> Up until this time I have not filtered because the filtering >> parameters looked arbitrary and I >> thought that it was cheating to reduce the # of tests used to >> compute the FDR. From reading and >> further reflection I now believe otherwise. But whereas I now >> believe I should filter I am >> not at all sure what parameters to use, and how much my final list >> of differentially expressed genes >> will be sensitive to a choice of those parameters. In particular, >> i wonder if the >> intensity filter cutoff should vary with chip-type and >> preprocessing method (eg GCRMA). >> Any thoughts and guidance would be appreciated. >> Thanks as always, >> Rich >> ------------------------------------------------------------ >> Richard A. Friedman, PhD >> Biomedical Informatics Shared Resource >> Herbert Irving Comprehensive Cancer Center (HICCC) >> Lecturer >> Department of Biomedical Informatics (DBMI) >> Educational Coordinator >> Center for Computational Biology and Bioinformatics (C2B2) >> National Center for Multiscale Analysis of Genomic Networks (MAGNet) >> Box 95, Room 130BB or P&S 1-420C >> Columbia University Medical Center >> 630 W. 168th St. >> New York, NY 10032 >> (212)305-6901 (5-6901) (voice) >> friedman at cancercenter.columbia.edu >> http://cancercenter.columbia.edu/~friedman/ >> "Sure I am willing to stop watching television >> to get a better education." >> -Rose Friedman, age 11 >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/ >> gmane.science.biology.informatics.conductor > > -- > James W. MacDonald, M.S. > Biostatistician > Affymetrix and cDNA Microarray Core > University of Michigan Cancer Center > 1500 E. Medical Center Drive > 7410 CCGC > Ann Arbor MI 48109 > 734-647-5623
ADD REPLY
0
Entering edit mode
Hi Rich, Richard Friedman wrote: > Jim, > > Thank you for your detailed and helpful reply. > On Mar 19, 2008, at 4:52 PM, James W. MacDonald wrote: > >> >> That depends. If you are using rma(), then no ;-P > > what about gcrma. Same diff. The maximum with either will be ~14, so filtering on 100 will remove everything. > >> >> You might try something like >> >> eset2 <- nsFilter(eset)$eset >> >> and see how many probesets you end up with. > > I have tried > > > xen2nsSUB<-nsFilter(xen2dataeset)$xen2dataeset > > sum(xen2nsSUB) > [1] 0 > > xen2nsSUB > NULL Yup. That should be xen2nsSUB <- nsFilter(xen2dataeset)$eset if you just want the resulting ExpressionSet. > > I then tried (which I think is the correct way to do it)/ > > > xen2nsSUB2<-nsFilter(xen2dataeset) > > xen2nsSUB2 > $eset > ExpressionSet (storageMode: lockedEnvironment) > assayData: 623 features, 4 samples > element names: exprs > phenoData > sampleNames: A_xen_1_21.cel, A_xen_2_22.cel, D_nodal_1_27.cel, > D_nodal_2_2 > 8.cel > varLabels and varMetadata description: > sample: arbitrary numbering > featureData > featureNames: 1428670_at, 1457283_at, ..., 1426849_at (623 total) > fvarLabels and fvarMetadata description: none > experimentData: use 'experimentData(object)' > Annotation: mouse4302 > > $filter.log > $filter.log$numDupsRemoved > [1] 77 > > $filter.log$numLowVar > [1] 41507 > > $filter.log$feature.exclude > [1] 13 > > $filter.log$numRemoved.ENTREZID > [1] 2881 > > > >> >>> 2, Is the only effect of filtering to reduce the multiplier in the >>> false discovery >>> analysis OR does it reduce false positives in other ways by >>> A. In the case of intensity filters by reducing the number of >>> large fold changes resulting >>> from the ratios of small numbers. >>> B. In the case of IQR filters eliminating large t-statistics >>> resulting for genes with small variation >>> across samples but fortuitously low standard deviations, >> >> Yes and yes, to a certain extent. If you are just doing fold changes, >> you might consider filtering on each fold change rather than overall. >> For instance you could create a filter >> >> filt <- filterfun(kOverA(1, 100)) >> >> that you would then use for each fold change comparison to ensure that >> at least one of the samples had an expression > 100. Shameless plug - >> see foldFilt() in affycoretools. > > I think that that is basically what I did with genefilter > pOverA(0.25,log2(100) > described in my first note (.25 of 4 =1). Or am I getting somehing wrong. Well, that isn't what you did (or maybe it is what you did, but you didn't do what I am suggesting). If you are doing fold change calculations then you (IMO) only care about the two things under consideration. So if you have something like this: Samples 1 2 3 4 expval 30 85 1500 2500 Then what you did will nuke that probeset. However, the comparisons for 1v3, 1v4, 2v3, 2v4 and 3v4 are probably quite useful. The only one you don't care about is 1v2, which will give a high fold change but it is probably not meaningful. > > >> >> If you are doing t-stats with a very small number of replicates (like >> 2 vs 2), then you should be using limma, and in which case >> over-filtering the data can be detrimental as well. The reason for >> that is the prior will be estimated on all the probesets that remain, >> and if all you have are highly variable probesets then the prior will >> be larger than you might want. I have seen cases with very small >> numbers of replicates where using all the data on the chip resulted in >> many more significant probesets than if I did what I thought was a >> reasonable filter. >> > > I am using Limma. I asked for at least 3 sreplicates from the > experimentalist but she only gave me 2 (story of my life). LOL. Sounds like the story of my life as well. Unfortunately, I am not the arbiter of scientific integrity, so I just analyze what I get... > I got 731 with just the variance filter, 289 with the log2 filter, and > 619 with nsFilter. > > How many probesets do I need for limma to function properly? Do the > above numbers seem to small. I think this is one of those double-edged sword type things. Fewer probesets are better for multiplicity corrections, but if you just have the variable ones left maybe the prior will be bigger than you want. I don't think I can give you a hard number -- I'm just pointing out that there are two considerations to be made and that it might not be ideal to go to one or the other extreme. > > >> Of course the question remains; is more better? >> And if more is better, does that mean the ideal would be to find all >> probesets differentially expressed? Probably not, so we are back to >> the usual prescriptions; check your data carefully. Make sure your >> results are sensible. Do EDA to ensure that you don't have some wacky >> chip messing things up. > > What is EDA? I did all of the quality measures in simpleaffy and in > AffyPLM and the chips look fine. Exploratory Data Analysis. Stats 101 -- look at the raw data and see if there is some kind of wackyness going on. The above is usually what I do as well, plus a PCA shot to make sure my replicates are all friendly-like. The limma package now allows you to weight chips (?arrayWeights), so that is really nice if you have some sketchy chips in there that you can't really afford to nuke. Best, Jim > >> Check your code to be sure that you haven't made the kind of errors >> that I like to make. Consult with the experimenter to see if very few >> genes should be changing (or be expressed at all). > > I have done both of these things. > > Here is my present understanding of the situation following your note: > > 1. Filter by variance and other nsFilter parmaeters is good unless it > leads to too few probestes. > > Q1. How many are too small? > > Q2. It is advisable to use an intensity filter but log2(100) on at least > one chip is too high on dim chips? > Q2B. Is there a way to quantify dimness so I knwo how to adjust the > cutoff? > Q2C. Or am I better off filtering only on variance, or variance plus > the nsFilter defaults. > > > > THANKS! > Rich > > > > > > >> >> Best, >> >> Jim >> >> >>> Up until this time I have not filtered because the filtering >>> parameters looked arbitrary and I >>> thought that it was cheating to reduce the # of tests used to >>> compute the FDR. From reading and >>> further reflection I now believe otherwise. But whereas I now >>> believe I should filter I am >>> not at all sure what parameters to use, and how much my final list >>> of differentially expressed genes >>> will be sensitive to a choice of those parameters. In particular, i >>> wonder if the >>> intensity filter cutoff should vary with chip-type and preprocessing >>> method (eg GCRMA). >>> Any thoughts and guidance would be appreciated. >>> Thanks as always, >>> Rich >>> ------------------------------------------------------------ >>> Richard A. Friedman, PhD >>> Biomedical Informatics Shared Resource >>> Herbert Irving Comprehensive Cancer Center (HICCC) >>> Lecturer >>> Department of Biomedical Informatics (DBMI) >>> Educational Coordinator >>> Center for Computational Biology and Bioinformatics (C2B2) >>> National Center for Multiscale Analysis of Genomic Networks (MAGNet) >>> Box 95, Room 130BB or P&S 1-420C >>> Columbia University Medical Center >>> 630 W. 168th St. >>> New York, NY 10032 >>> (212)305-6901 (5-6901) (voice) >>> friedman at cancercenter.columbia.edu >>> http://cancercenter.columbia.edu/~friedman/ >>> "Sure I am willing to stop watching television >>> to get a better education." >>> -Rose Friedman, age 11 >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> -- >> James W. MacDonald, M.S. >> Biostatistician >> Affymetrix and cDNA Microarray Core >> University of Michigan Cancer Center >> 1500 E. Medical Center Drive >> 7410 CCGC >> Ann Arbor MI 48109 >> 734-647-5623 >
ADD REPLY
0
Entering edit mode
Jim, Thanks again for your quick and helpful reply. I have some disagreements and a further question. On Mar 19, 2008, at 11:27 PM, James W. MacDonald wrote: > Hi Rich, > > Richard Friedman wrote: >> Jim, >> Thank you for your detailed and helpful reply. >> On Mar 19, 2008, at 4:52 PM, James W. MacDonald wrote: >>> >>> That depends. If you are using rma(), then no ;-P >> what about gcrma. > > Same diff. The maximum with either will be ~14, so filtering on 100 > will remove everything. > I filtered on log2(100)=6.64, which is well under 14. Based upon this filter alone I got 9681 probesets. This as about 25% of the probesets. I guess I still am wondering if there is a way of taking the intensity curve into account in setting the cutoff. >>> >>> You might try something like >>> >>> eset2 <- nsFilter(eset)$eset >>> >>> and see how many probesets you end up with. >> I have tried >> > xen2nsSUB<-nsFilter(xen2dataeset)$xen2dataeset >> > sum(xen2nsSUB) >> [1] 0 >> > xen2nsSUB >> NULL > > Yup. That should be > > xen2nsSUB <- nsFilter(xen2dataeset)$eset > > if you just want the resulting ExpressionSet. Most helpful! > >>> >>>> If you are just doing fold changes, you might consider filtering >>>> on each fold change rather than overall. For instance you could >>>> create a filter >>> >>> filt <- filterfun(kOverA(1, 100)) >>> >>> that you would then use for each fold change comparison to ensure >>> that at least one of the samples had an expression > 100. >>> Shameless plug - see foldFilt() in affycoretools. >> I think that that is basically what I did with genefilter pOverA >> (0.25,log2(100) >> described in my first note (.25 of 4 =1). Or am I getting somehing >> wrong. > > Well, that isn't what you did (or maybe it is what you did, but you > didn't do what I am suggesting). If you are doing fold change > calculations then you (IMO) only care about the two things under > consideration. So if you have something like this: > > Samples 1 2 3 4 > expval 30 85 1500 2500 > > Then what you did will nuke that probeset. However, the comparisons > for 1v3, 1v4, 2v3, 2v4 and 3v4 are probably quite useful. The only > one you don't care about is 1v2, which will give a high fold change > but it is probably not meaningful. > I fear that I don't understand filterfun. when I used kOverA(1,log2 (100) instead of pOverA above, I get the same # of probesets as I did with pOverA(.25,log2(100)) (9681). As I understand pOverA(.25, 100) it would not elminate this probeset because at least 25% is above 100). Best wishes, Rich >>> >>>
ADD REPLY
0
Entering edit mode
Hi Rich, Richard Friedman wrote: > Jim, > > Thanks again for your quick and helpful reply. > I have some disagreements and a further question. > > On Mar 19, 2008, at 11:27 PM, James W. MacDonald wrote: > >> Hi Rich, >> >> Richard Friedman wrote: >>> Jim, >>> Thank you for your detailed and helpful reply. >>> On Mar 19, 2008, at 4:52 PM, James W. MacDonald wrote: >>>> >>>> That depends. If you are using rma(), then no ;-P >>> what about gcrma. >> >> Same diff. The maximum with either will be ~14, so filtering on 100 >> will remove everything. >> > > I filtered on log2(100)=6.64, which is well under 14. Based upon this > filter alone I got > 9681 probesets. > > This as about 25% of the probesets. I guess I still am wondering if > there is a way of taking the > intensity curve into account in setting the cutoff. Ah. I missed the log2() part, and assumed you were using MAS5 numbers. My bad. > > > >>>> >>>> You might try something like >>>> >>>> eset2 <- nsFilter(eset)$eset >>>> >>>> and see how many probesets you end up with. >>> I have tried >>> > xen2nsSUB<-nsFilter(xen2dataeset)$xen2dataeset >>> > sum(xen2nsSUB) >>> [1] 0 >>> > xen2nsSUB >>> NULL >> >> Yup. That should be >> >> xen2nsSUB <- nsFilter(xen2dataeset)$eset >> >> if you just want the resulting ExpressionSet. > > Most helpful! >> >>>> >>>>> If you are just doing fold changes, you might consider filtering on >>>>> each fold change rather than overall. For instance you could create >>>>> a filter >>>> >>>> filt <- filterfun(kOverA(1, 100)) >>>> >>>> that you would then use for each fold change comparison to ensure >>>> that at least one of the samples had an expression > 100. Shameless >>>> plug - see foldFilt() in affycoretools. >>> I think that that is basically what I did with genefilter >>> pOverA(0.25,log2(100) >>> described in my first note (.25 of 4 =1). Or am I getting somehing >>> wrong. >> >> Well, that isn't what you did (or maybe it is what you did, but you >> didn't do what I am suggesting). If you are doing fold change >> calculations then you (IMO) only care about the two things under >> consideration. So if you have something like this: >> >> Samples 1 2 3 4 >> expval 30 85 1500 2500 >> >> Then what you did will nuke that probeset. However, the comparisons >> for 1v3, 1v4, 2v3, 2v4 and 3v4 are probably quite useful. The only one >> you don't care about is 1v2, which will give a high fold change but it >> is probably not meaningful. >> > > I fear that I don't understand filterfun. when I used kOverA(1,log2(100) > instead of pOverA above, I get the same # of probesets as I did > with pOverA(.25,log2(100)) (9681). > > As I understand pOverA(.25, 100) it would not elminate this probeset > because at least 25% is above 100). Again, my bad. You are correct that the pOverA() and kOverA() filters will be the same. But my main contention (that overall filtering if you are doing fold change analyses is less useful) remains. In your case this is a moot point since you do have duplicates. However, if you just had single samples (say control and three treatments), then filtering using either kOverA(1, log2(100)) or pOverA(0.25, log2(100)) can still end up giving bad results. Say the data were slightly different: Samples 1 2 3 4 expval 2 4 6 10 Note these are log_2 data. Your filter would keep this probeset, and in each of the 1v2, 1v3, 2v3 comparisons you would get four-fold differences even though you really didn't want to see this sort of thing. However, if you did a kOverA(1,log2(100)) as a filter for each fold change, you would only end up with comparisons that involved sample 4, which is what you would like (hence my shameless plug for foldFilt()). Best, Jim > > Best wishes, > Rich > > > > >>>> >>>> -- James W. MacDonald, M.S. Biostatistician Affymetrix and cDNA Microarray Core University of Michigan Cancer Center 1500 E. Medical Center Drive 7410 CCGC Ann Arbor MI 48109 734-647-5623
ADD REPLY

Login before adding your answer.

Traffic: 767 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6