advice on absent present filtering needed

0

Entering edit mode

Kimpel, Mark W ▴ 890

@kimpel-mark-w-727

Last seen 9.6 years ago

I have a question about how to properly apply the MAS5 absent present filtering technique. Within my group, I am advocating setting a cutoff ratio of absent present across phenotypes (i.e. all samples), whereas a colleague is advocating applying the filter within phenotype and passing through the filter any probeset with the A/P ratio of >0.5 within any of the phenotypes (we have 3). The argument my colleague makes is that some probesets may only be expressed by one phenotype and we want to keep these in, but be stringent within phenotype. This makes some biologic sense, but I am concerned that this filtering within phenotype will introduce bias as low expression levels, as it would seem to, at least in some cases, act like a fold filter at expression levels near the limit of reliable detection. Advice? Mark Mark W. Kimpel MD ? Official Business Address: ? Department of Psychiatry Indiana University School of Medicine PR M116 Institute of Psychiatric Research 791 Union Drive Indianapolis, IN 46202 ? Preferred Mailing Address: ? 15032 Hunter Court Westfield, IN? 46074 ? (317) 490-5129 Work, & Mobile ? (317) 663-0513 Home (no voice mail please) 1-(317)-536-2730 FAX

• 970 views

ADD COMMENT • link updated 17.5 years ago by Naomi Altman ★ 6.0k • written 17.5 years ago by Kimpel, Mark W ▴ 890

0

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 8 minutes ago

United States

Hi Mark, Kimpel, Mark William wrote: > I have a question about how to properly apply the MAS5 absent present > filtering technique. Within my group, I am advocating setting a > cutoff ratio of absent present across phenotypes (i.e. all samples), > whereas a colleague is advocating applying the filter within > phenotype and passing through the filter any probeset with the A/P > ratio of >0.5 within any of the phenotypes (we have 3). > > The argument my colleague makes is that some probesets may only be > expressed by one phenotype and we want to keep these in, but be > stringent within phenotype. This makes some biologic sense, but I am > concerned that this filtering within phenotype will introduce bias as > low expression levels, as it would seem to, at least in some cases, > act like a fold filter at expression levels near the limit of > reliable detection. Personally, I am conflicted about pre-filtering data, but when/if I do so, I generally try to stick with 'agnostic' methods that don't account for the sample types. I am conflicted about both the need for pre-filtering and the methods used to do so. If one is selecting genes based on a standard t-test, then clearly there needs to be some pre-filtering because (usually) there isn't much replication, and one would want to guard against selecting genes based on the very poor variance estimates that result. However, if you use some of the available shrinkage estimators (the eBayes() method in limma for one), then the shrinkage estimator is based on _all_ the probesets on the chip. If you remove the probesets that don't vary much, then you are biasing the shrinkage estimator that you will use in the subsequent eBayes() step. I am also not convinced that a comparison between PM and MM expression levels is a reasonable measure of transcript presence. Since 30 - 40% of the MM probes on a given chip have larger intensity values than the corresponding PM probe, I worry that one might end up throwing out probesets based on bad MM probes rather than lack of information. I do realize that MAS5 uses the ideal mismatch (IM) rather than the MM intensity, but the algorithm used to come up with the IM is a bit ad hoc for my tastes. In the past, I tended to use the kOverA() method available in genefilter. This is agnostic in that it doesn't require any particular subset to have a higher expression, but does require that _some_ samples do. One could argue that this isn't that reasonable because of the cutoff imposed, which presupposes that a probeset with an expression below X isn't interesting. Lately, if I do filter, I have been filtering probesets based on the variance over all samples. If the variance isn't greater than some ad hoc value (usually 0.1 for rma numbers), then it's outta there. This is probably a bit more defensible because I am not directly specifying a cutoff, but using variance instead of say, standard deviation, does tend to favor probesets with a larger average expression. However, a plot of mean expression vs variance indicates to me that this is not overwhelming. Anyway, after all that rambling, I would say that you are probably advocating the better of the two filtering procedures. Although your colleague has a point, I think that method might bias your results. You could split the difference and require 33.33333% of the samples to be present ;-D Best, Jim > > Advice? > > Mark > > Mark W. Kimpel MD > > > Official Business Address: > > Department of Psychiatry Indiana University School of Medicine PR > M116 Institute of Psychiatric Research 791 Union Drive Indianapolis, > IN 46202 > > Preferred Mailing Address: > > 15032 Hunter Court Westfield, IN 46074 > > (317) 490-5129 Work, & Mobile > > (317) 663-0513 Home (no voice mail please) 1-(317)-536-2730 FAX > > _______________________________________________ Bioconductor mailing > list Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor Search the > archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald University of Michigan Affymetrix and cDNA Microarray Core 1500 E Medical Center Drive Ann Arbor MI 48109 734-647-5623 ********************************************************** Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues.

ADD COMMENT • link 17.5 years ago James W. MacDonald 65k

0

Entering edit mode

Naomi Altman ★ 6.0k

@naomi-altman-380

Last seen 3.0 years ago

United States

Your colleague is right. Surely it is important to know if some genes express only in certain phenotypes. Your method loses this information. --Naomi At 10:53 PM 10/25/2006, Kimpel, Mark William wrote: >I have a question about how to properly apply the MAS5 absent >present filtering technique. Within my group, I am advocating >setting a cutoff ratio of absent present across phenotypes (i.e. all >samples), whereas a colleague is advocating applying the filter >within phenotype and passing through the filter any probeset with >the A/P ratio of >0.5 within any of the phenotypes (we have 3). > >The argument my colleague makes is that some probesets may only be >expressed by one phenotype and we want to keep these in, but be >stringent within phenotype. This makes some biologic sense, but I am >concerned that this filtering within phenotype will introduce bias >as low expression levels, as it would seem to, at least in some >cases, act like a fold filter at expression levels near the limit of >reliable detection. > >Advice? > >Mark > >Mark W. Kimpel MD > > >Official Business Address: > >Department of Psychiatry >Indiana University School of Medicine >PR M116 >Institute of Psychiatric Research >791 Union Drive >Indianapolis, IN 46202 > >Preferred Mailing Address: > >15032 Hunter Court >Westfield, IN 46074 > >(317) 490-5129 Work, & Mobile > >(317) 663-0513 Home (no voice mail please) >1-(317)-536-2730 FAX > >_______________________________________________ >Bioconductor mailing list >Bioconductor at stat.math.ethz.ch >https://stat.ethz.ch/mailman/listinfo/bioconductor >Search the archives: >http://news.gmane.org/gmane.science.biology.informatics.conductor Naomi S. Altman 814-865-3791 (voice) Associate Professor Dept. of Statistics 814-863-7114 (fax) Penn State University 814-865-1348 (Statistics) University Park, PA 16802-2111

ADD COMMENT • link 17.5 years ago Naomi Altman ★ 6.0k

0

Entering edit mode

I concur - You do not want to throw out genes that only express in one phenotype. If you plot a histogram of the number of present calls for each gene, you will see that the vast majority of genes are either present in all samples or absent in all samples. It is only the small number of genes in between that your filter options will affect. To be conservative, I keep a gene even if it is present in only 1 sample, so I don't even consider phenotype. The difference really will only affect a few hundred genes, which won't matter too much in terms of fdr correction, so I say be conservative so you don't throw out a gene that is expressed in only one phenotype. To check the histogram: calls.eset <- mas5calls(abatch) hist(apply(exprs(calls.eset), 1, function(x) sum(x=="P"))) Cheers, Jenny At 08:27 AM 10/26/2006, you wrote: >Your colleague is right. Surely it is important to know if some >genes express only in certain phenotypes. Your method loses this information. > >--Naomi > > >At 10:53 PM 10/25/2006, Kimpel, Mark William wrote: > >I have a question about how to properly apply the MAS5 absent > >present filtering technique. Within my group, I am advocating > >setting a cutoff ratio of absent present across phenotypes (i.e. all > >samples), whereas a colleague is advocating applying the filter > >within phenotype and passing through the filter any probeset with > >the A/P ratio of >0.5 within any of the phenotypes (we have 3). > > > >The argument my colleague makes is that some probesets may only be > >expressed by one phenotype and we want to keep these in, but be > >stringent within phenotype. This makes some biologic sense, but I am > >concerned that this filtering within phenotype will introduce bias > >as low expression levels, as it would seem to, at least in some > >cases, act like a fold filter at expression levels near the limit of > >reliable detection. > > > >Advice? > > > >Mark > > > >Mark W. Kimpel MD > > > > > >Official Business Address: > > > >Department of Psychiatry > >Indiana University School of Medicine > >PR M116 > >Institute of Psychiatric Research > >791 Union Drive > >Indianapolis, IN 46202 > > > >Preferred Mailing Address: > > > >15032 Hunter Court > >Westfield, IN 46074 > > > >(317) 490-5129 Work, & Mobile > > > >(317) 663-0513 Home (no voice mail please) > >1-(317)-536-2730 FAX > > > >_______________________________________________ > >Bioconductor mailing list > >Bioconductor at stat.math.ethz.ch > >https://stat.ethz.ch/mailman/listinfo/bioconductor > >Search the archives: > >http://news.gmane.org/gmane.science.biology.informatics.conductor > >Naomi S. Altman 814-865-3791 (voice) >Associate Professor >Dept. of Statistics 814-863-7114 (fax) >Penn State University 814-865-1348 (Statistics) >University Park, PA 16802-2111 > >_______________________________________________ >Bioconductor mailing list >Bioconductor at stat.math.ethz.ch >https://stat.ethz.ch/mailman/listinfo/bioconductor >Search the archives: >http://news.gmane.org/gmane.science.biology.informatics.conductor Jenny Drnevich, Ph.D. Functional Genomics Bioinformatics Specialist W.M. Keck Center for Comparative and Functional Genomics Roy J. Carver Biotechnology Center University of Illinois, Urbana-Champaign 330 ERML 1201 W. Gregory Dr. Urbana, IL 61801 USA ph: 217-244-7355 fax: 217-265-5066 e-mail: drnevich at uiuc.edu

ADD REPLY • link 17.5 years ago Jenny Drnevich ★ 2.2k

0

Entering edit mode

Jenny and Naomi, Thank you for your replies and code (Jenny). I did not mean to imply that I would throw out probesets if they are present in only one phenotype, but that I would keep in probesets if the same actual number were counted present across all samples. Say, for example, we have 10 samples with 5 in each phenotype. We decide that we would like to pass through our filter any probeset that is present in at least 80% of probesets (4) within a phenotype. I would argue that to absolutely reduce bias we should construct the filter so that we pass through probesets that are present in 4 out of 10 samples. This is actually a more generous filter but would seem to better preserve the underlying statistical distribution of data, which the BH FDR method depends on. I do recognize that, by passing a few more genes through the filter, that we will end up raising the calculated FDR of all probesets tested and that, thereby, we may end up with fewer significant probesets. But, would we have more confidence in those? I also recognize that these effects will be subtle and perhaps have little practical effect, but I do want to be rigorous in my approach. Your responses? Thanks, Mark Mark W. Kimpel MD (317) 490-5129 Work, & Mobile (317) 663-0513 Home (no voice mail please) 1-(317)-536-2730 FAX -----Original Message----- From: bioconductor-bounces@stat.math.ethz.ch [mailto:bioconductor-bounces at stat.math.ethz.ch] On Behalf Of Jenny Drnevich Sent: Thursday, October 26, 2006 10:42 AM To: bioconductor at stat.math.ethz.ch Subject: Re: [BioC] advice on absent present filtering needed I concur - You do not want to throw out genes that only express in one phenotype. If you plot a histogram of the number of present calls for each gene, you will see that the vast majority of genes are either present in all samples or absent in all samples. It is only the small number of genes in between that your filter options will affect. To be conservative, I keep a gene even if it is present in only 1 sample, so I don't even consider phenotype. The difference really will only affect a few hundred genes, which won't matter too much in terms of fdr correction, so I say be conservative so you don't throw out a gene that is expressed in only one phenotype. To check the histogram: calls.eset <- mas5calls(abatch) hist(apply(exprs(calls.eset), 1, function(x) sum(x=="P"))) Cheers, Jenny At 08:27 AM 10/26/2006, you wrote: >Your colleague is right. Surely it is important to know if some >genes express only in certain phenotypes. Your method loses this information. > >--Naomi > > >At 10:53 PM 10/25/2006, Kimpel, Mark William wrote: > >I have a question about how to properly apply the MAS5 absent > >present filtering technique. Within my group, I am advocating > >setting a cutoff ratio of absent present across phenotypes (i.e. all > >samples), whereas a colleague is advocating applying the filter > >within phenotype and passing through the filter any probeset with > >the A/P ratio of >0.5 within any of the phenotypes (we have 3). > > > >The argument my colleague makes is that some probesets may only be > >expressed by one phenotype and we want to keep these in, but be > >stringent within phenotype. This makes some biologic sense, but I am > >concerned that this filtering within phenotype will introduce bias > >as low expression levels, as it would seem to, at least in some > >cases, act like a fold filter at expression levels near the limit of > >reliable detection. > > > >Advice? > > > >Mark > > > >Mark W. Kimpel MD > > > > > >Official Business Address: > > > >Department of Psychiatry > >Indiana University School of Medicine > >PR M116 > >Institute of Psychiatric Research > >791 Union Drive > >Indianapolis, IN 46202 > > > >Preferred Mailing Address: > > > >15032 Hunter Court > >Westfield, IN 46074 > > > >(317) 490-5129 Work, & Mobile > > > >(317) 663-0513 Home (no voice mail please) > >1-(317)-536-2730 FAX > > > >_______________________________________________ > >Bioconductor mailing list > >Bioconductor at stat.math.ethz.ch > >https://stat.ethz.ch/mailman/listinfo/bioconductor > >Search the archives: > >http://news.gmane.org/gmane.science.biology.informatics.conductor > >Naomi S. Altman 814-865-3791 (voice) >Associate Professor >Dept. of Statistics 814-863-7114 (fax) >Penn State University 814-865-1348 (Statistics) >University Park, PA 16802-2111 > >_______________________________________________ >Bioconductor mailing list >Bioconductor at stat.math.ethz.ch >https://stat.ethz.ch/mailman/listinfo/bioconductor >Search the archives: >http://news.gmane.org/gmane.science.biology.informatics.conductor Jenny Drnevich, Ph.D. Functional Genomics Bioinformatics Specialist W.M. Keck Center for Comparative and Functional Genomics Roy J. Carver Biotechnology Center University of Illinois, Urbana-Champaign 330 ERML 1201 W. Gregory Dr. Urbana, IL 61801 USA ph: 217-244-7355 fax: 217-265-5066 e-mail: drnevich at uiuc.edu _______________________________________________ Bioconductor mailing list Bioconductor at stat.math.ethz.ch https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 17.5 years ago Kimpel, Mark W ▴ 890

0

Entering edit mode

An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20061026/ 5bc23c8b/attachment.pl

ADD REPLY • link 17.5 years ago Sharon Anbu ▴ 480

Login before adding your answer.