removal of genes with low expression values
3
0
Entering edit mode
@hernando-martinez-4124
Last seen 9.6 years ago
Hi, I need to remove genes with low expression values from a expression matrix. I would like to remove those with an average of expression values less than a certain cut-off. I was thinking in computing the average for each row, create a list with the gene names for which their average is less than the cut-off, and remove those genes from the initial matrix. However, I have a couple of doubts that maybe you can help me with. Is there any package or function that makes this easier? And, does anyone know which cut-off to use for data normalized with RMA and for data normalized with MAS5? Thanks, -- Hernando Martínez Vergara [[alternative HTML version deleted]]
• 3.3k views
ADD COMMENT
0
Entering edit mode
@sean-davis-490
Last seen 12 weeks ago
United States
On Wed, Jul 7, 2010 at 6:28 AM, Hernando Martínez <hernybiotec@gmail.com>wrote: > Hi, I need to remove genes with low expression values from a expression > matrix. I would like to remove those with an average of expression values > less than a certain cut-off. I was thinking in computing the average for > each row, create a list with the gene names for which their average is less > than the cut-off, and remove those genes from the initial matrix. However, > I > have a couple of doubts that maybe you can help me with. Is there any > package or function that makes this easier? And, does anyone know which > cut-off to use for data normalized with RMA and for data normalized with > MAS5? Thanks, > > Take a look at the genefilter package. However, what you describe can be done easily with standard R. I don't think there is such a thing as a "standard" cutoff for microarray data. Sean [[alternative HTML version deleted]]
ADD COMMENT
0
Entering edit mode
Usually it can be filtered based on IQR and/or variance across samples, which might be worthy of thinking besides 'average' Yuan On 7 Jul 2010, at 11:47, Sean Davis wrote: > On Wed, Jul 7, 2010 at 6:28 AM, Hernando Mart?nez <hernybiotec at="" gmail.com=""> >wrote: > >> Hi, I need to remove genes with low expression values from a >> expression >> matrix. I would like to remove those with an average of expression >> values >> less than a certain cut-off. I was thinking in computing the >> average for >> each row, create a list with the gene names for which their average >> is less >> than the cut-off, and remove those genes from the initial matrix. >> However, >> I >> have a couple of doubts that maybe you can help me with. Is there any >> package or function that makes this easier? And, does anyone know >> which >> cut-off to use for data normalized with RMA and for data normalized >> with >> MAS5? Thanks, >> >> > Take a look at the genefilter package. However, what you describe > can be > done easily with standard R. > > I don't think there is such a thing as a "standard" cutoff for > microarray > data. > > Sean > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLY
0
Entering edit mode
Isn't filtering on spread like pretesting for differential expression? Maybe not such a good idea. When using MAS5, it is traditional to use the Affy presence/absence score. (Not necessarily optimal ...) --Naomi At 06:53 AM 7/7/2010, Yuan Hao wrote: >Usually it can be filtered based on IQR and/or variance across >samples, which might be worthy of thinking besides 'average' > >Yuan > >On 7 Jul 2010, at 11:47, Sean Davis wrote: > >>On Wed, Jul 7, 2010 at 6:28 AM, Hernando >>Mart?nez <hernybiotec at="" gmail.com="">wrote: >> >>>Hi, I need to remove genes with low expression values from a >>>expression >>>matrix. I would like to remove those with an average of expression >>>values >>>less than a certain cut-off. I was thinking in computing the >>>average for >>>each row, create a list with the gene names for which their average >>>is less >>>than the cut-off, and remove those genes from the initial matrix. >>>However, >>>I >>>have a couple of doubts that maybe you can help me with. Is there any >>>package or function that makes this easier? And, does anyone know >>>which >>>cut-off to use for data normalized with RMA and for data normalized >>>with >>>MAS5? Thanks, >>> >>Take a look at the genefilter package. However, what you describe >>can be >>done easily with standard R. >> >>I don't think there is such a thing as a "standard" cutoff for >>microarray >>data. >> >>Sean >> >> [[alternative HTML version deleted]] >> >>_______________________________________________ >>Bioconductor mailing list >>Bioconductor at stat.math.ethz.ch >>https://stat.ethz.ch/mailman/listinfo/bioconductor >>Search the archives: >>http://news.gmane.org/gmane.science.biology.informatics.conductor > >_______________________________________________ >Bioconductor mailing list >Bioconductor at stat.math.ethz.ch >https://stat.ethz.ch/mailman/listinfo/bioconductor >Search the archives: >http://news.gmane.org/gmane.science.biology.informatics.conductor Naomi S. Altman 814-865-3791 (voice) Associate Professor Dept. of Statistics 814-863-7114 (fax) Penn State University 814-865-1348 (Statistics) University Park, PA 16802-2111
ADD REPLY
0
Entering edit mode
hi Hernando, I suggest you have a read of: J. N. McClintick & H. J. Edenberg. Effects of filtering by Present call on analysis of microarray experiments.. BMC Bioinformatics, 2006 (7): 49. there's no standard cutoff, but if you plot histograms of normalised data, then can help you choose a cutoff cheers, mark On 07/07/2010, at 11:22 PM, Naomi Altman wrote: > Isn't filtering on spread like pretesting for differential expression? Maybe not such a good idea. > > When using MAS5, it is traditional to use the Affy presence/absence score. (Not necessarily optimal ...) > > --Naomi > > At 06:53 AM 7/7/2010, Yuan Hao wrote: >> Usually it can be filtered based on IQR and/or variance across >> samples, which might be worthy of thinking besides 'average' >> >> Yuan >> >> On 7 Jul 2010, at 11:47, Sean Davis wrote: >> >>> On Wed, Jul 7, 2010 at 6:28 AM, Hernando Mart?nez <hernybiotec at="" gmail.com="">wrote: >>> >>>> Hi, I need to remove genes with low expression values from a >>>> expression >>>> matrix. I would like to remove those with an average of expression >>>> values >>>> less than a certain cut-off. I was thinking in computing the >>>> average for >>>> each row, create a list with the gene names for which their average >>>> is less >>>> than the cut-off, and remove those genes from the initial matrix. >>>> However, >>>> I >>>> have a couple of doubts that maybe you can help me with. Is there any >>>> package or function that makes this easier? And, does anyone know >>>> which >>>> cut-off to use for data normalized with RMA and for data normalized >>>> with >>>> MAS5? Thanks, >>>> >>> Take a look at the genefilter package. However, what you describe >>> can be >>> done easily with standard R. >>> >>> I don't think there is such a thing as a "standard" cutoff for >>> microarray >>> data. >>> >>> Sean >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > Naomi S. Altman 814-865-3791 (voice) > Associate Professor > Dept. of Statistics 814-863-7114 (fax) > Penn State University 814-865-1348 (Statistics) > University Park, PA 16802-2111 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLY
0
Entering edit mode
Hi, On Wed, Jul 7, 2010 at 6:22 AM, Naomi Altman <naomi at="" stat.psu.edu=""> wrote: > Isn't filtering on spread like pretesting for differential expression? > ?Maybe not such a good idea. No it isn't like pretesting and it is quite often a good idea. Please have a look at: Independent filtering increases detection power for high-throughput experiments. Bourgon R, Gentleman R, Huber W. Proc Natl Acad Sci U S A. 2010 May 25;107(21):9546-51. Epub 2010 May 11. Where the reasons why it is not like a pretest are given and the potential benefits are laid out. > > When using MAS5, it is traditional to use the Affy presence/absence score. > ?(Not necessarily optimal ...) > > --Naomi > > At 06:53 AM 7/7/2010, Yuan Hao wrote: >> >> Usually it can be filtered based on IQR and/or variance across >> samples, which might be worthy of thinking besides 'average' >> >> Yuan >> >> On 7 Jul 2010, at 11:47, Sean Davis wrote: >> >>> On Wed, Jul 7, 2010 at 6:28 AM, Hernando Mart?nez <hernybiotec at="" gmail.com="">>> >wrote: >>> >>>> Hi, I need to remove genes with low expression values from a >>>> expression >>>> matrix. I would like to remove those with an average of expression >>>> values >>>> less than a certain cut-off. I was thinking in computing the >>>> average for >>>> each row, create a list with the gene names for which their average >>>> is less >>>> than the cut-off, and remove those genes from the initial matrix. >>>> However, >>>> I >>>> have a couple of doubts that maybe you can help me with. Is there any >>>> package or function that makes this easier? And, does anyone know >>>> which >>>> cut-off to use for data normalized with RMA and for data normalized >>>> with >>>> MAS5? Thanks, >>>> >>> Take a look at the genefilter package. ?However, what you describe >>> can be >>> done easily with standard R. >>> >>> I don't think there is such a thing as a "standard" cutoff for >>> microarray >>> data. >>> >>> Sean >>> >>> ? ? ? ?[[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > Naomi S. Altman ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?814-865-3791 (voice) > Associate Professor > Dept. of Statistics ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?814-863-7114 (fax) > Penn State University ? ? ? ? ? ? ? ? ? ? ? ? 814-865-1348 (Statistics) > University Park, PA 16802-2111 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > -- Robert Gentleman rgentlem at gmail.com
ADD REPLY
0
Entering edit mode
Hernando The article that Robert mentions below also demonstrates that for Affymetrix data, filtering by overall variance is preferable to filtering by average level when your goal is to detect differentially expressed genes. In a nutshell, you want a filter criterion that is independent of your test statistic *under the null*, but correlated under the alternatives. Wolfgang On Jul/8/10 7:32 PM, Robert Gentleman wrote: > Hi, > > > On Wed, Jul 7, 2010 at 6:22 AM, Naomi Altman<naomi at="" stat.psu.edu=""> wrote: >> Isn't filtering on spread like pretesting for differential expression? >> Maybe not such a good idea. > > No it isn't like pretesting and it is quite often a good idea. > Please have a look at: > > Independent filtering increases detection power for high-throughput experiments. > Bourgon R, Gentleman R, Huber W. > Proc Natl Acad Sci U S A. 2010 May 25;107(21):9546-51. Epub 2010 May 11. > Where the reasons why it is not like a pretest are given and the > potential benefits are laid out. > > > >> >> When using MAS5, it is traditional to use the Affy presence/absence score. >> (Not necessarily optimal ...) >> >> --Naomi >> >> At 06:53 AM 7/7/2010, Yuan Hao wrote: >>> >>> Usually it can be filtered based on IQR and/or variance across >>> samples, which might be worthy of thinking besides 'average' >>> >>> Yuan >>> >>> On 7 Jul 2010, at 11:47, Sean Davis wrote: >>> >>>> On Wed, Jul 7, 2010 at 6:28 AM, Hernando Mart?nez<hernybiotec at="" gmail.com="">>>>> wrote: >>>> >>>>> Hi, I need to remove genes with low expression values from a >>>>> expression >>>>> matrix. I would like to remove those with an average of expression >>>>> values >>>>> less than a certain cut-off. I was thinking in computing the >>>>> average for >>>>> each row, create a list with the gene names for which their average >>>>> is less >>>>> than the cut-off, and remove those genes from the initial matrix. >>>>> However, >>>>> I >>>>> have a couple of doubts that maybe you can help me with. Is there any >>>>> package or function that makes this easier? And, does anyone know >>>>> which >>>>> cut-off to use for data normalized with RMA and for data normalized >>>>> with >>>>> MAS5? Thanks, >>>>> >>>> Take a look at the genefilter package. However, what you describe >>>> can be >>>> done easily with standard R. >>>> >>>> I don't think there is such a thing as a "standard" cutoff for >>>> microarray >>>> data. >>>> >>>> Sean >>>> >>>> [[alternative HTML version deleted]] >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at stat.math.ethz.ch >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> Naomi S. Altman 814-865-3791 (voice) >> Associate Professor >> Dept. of Statistics 814-863-7114 (fax) >> Penn State University 814-865-1348 (Statistics) >> University Park, PA 16802-2111 >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> > > > -- Wolfgang Huber EMBL http://www.embl.de/research/units/genome_biology/huber
ADD REPLY
0
Entering edit mode
@michael-imbeault-3593
Last seen 9.6 years ago
If you know that a group of genes are not expressed (for example, Y chromosome genes in a female sample, or cell-type / tissue specific genes), you can use the value of those genes to determine your cutoff. It's not really algorithm (MAS or RMA) dependent. Michael On 07/07/2010 6:28 AM, Hernando Martínez wrote: > Hi, I need to remove genes with low expression values from a expression > matrix. I would like to remove those with an average of expression values > less than a certain cut-off. I was thinking in computing the average for > each row, create a list with the gene names for which their average is less > than the cut-off, and remove those genes from the initial matrix. However, I > have a couple of doubts that maybe you can help me with. Is there any > package or function that makes this easier? And, does anyone know which > cut-off to use for data normalized with RMA and for data normalized with > MAS5? Thanks, > > > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]
ADD COMMENT
0
Entering edit mode
Robert Castelo ★ 3.1k
@rcastelo
Last seen 21 hours ago
Barcelona/Universitat Pompeu Fabra
Hernando, if by removing genes with low expression values you mean removing genes that are not expressed or have an undetectable level of expression with the microarray instrument employed, the issue is not as easy as it may look like if you need to do it very accurately because different genes may behave differently due to complicated probe-specific effects. i would encourage you to take a look at this publication: Zilliox MJ, Irizarry RA A gene expression bar code for microarray data. Nat Methods. 2007 Nov;4(11):911-3. http://www.ncbi.nlm.nih.gov/pubmed/17906632 http://rafalab.jhsph.edu/barcode i'm not the authority about this, and the authors of the work could give you a more specific advice, but i'd say that the take-home message is that if you have a sufficiently large panel of biological replicates for a variety of cellular states, try to use them to find a gene-dependent threshold of expression under the assumption that only a small fraction of genes is expressed under a specific cellular state. cheers, robert. On Wed, 2010-07-07 at 11:28 +0100, Hernando Mart?nez wrote: > Hi, I need to remove genes with low expression values from a expression > matrix. I would like to remove those with an average of expression values > less than a certain cut-off. I was thinking in computing the average for > each row, create a list with the gene names for which their average is less > than the cut-off, and remove those genes from the initial matrix. However, I > have a couple of doubts that maybe you can help me with. Is there any > package or function that makes this easier? And, does anyone know which > cut-off to use for data normalized with RMA and for data normalized with > MAS5? Thanks, > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD COMMENT
0
Entering edit mode
Hi there, I have a question on Affymetrix Yeast 2.0 arrays. These arrays contain about 5000 probesets for budding and fission yeast, respectively. We hybridized budding yeast sample and now I'd like to get rid of all the fission yeast probes before normalization. There is a mask file outlining which probesets belong to which organism. However, when I use ReadAffy(..., rm.mask=T) the number of probes does not change (s. below). Is there a way to specify a mask file during the affybatch import? Any hints are welcome. Mike > d1 <- ReadAffy(celfile.path=celfile, filenames=filenames, rm.mask=F) > d2 <- ReadAffy(celfile.path=celfile, filenames=filenames, rm.mask=T) > dim(pm(d1)) [1] 120855 5 > dim(pm(d2)) [1] 120855 5 > sessionInfo() R version 2.10.1 (2009-12-14) i386-pc-mingw32 locale: [1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252 [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C [5] LC_TIME=German_Germany.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] yeast2cdf_2.5.0 affyQCReport_1.24.0 lattice_0.18-3 [4] RColorBrewer_1.0-2 affyPLM_1.22.0 preprocessCore_1.8.0 [7] xtable_1.5-6 simpleaffy_2.22.0 gcrma_2.18.1 [10] genefilter_1.28.2 affy_1.24.2 Biobase_2.6.1 loaded via a namespace (and not attached): [1] affyio_1.14.0 annotate_1.24.1 AnnotationDbi_1.8.2 [4] Biostrings_2.14.12 DBI_0.2-5 grid_2.10.1 [7] IRanges_1.4.16 RSQLite_0.9-1 splines_2.10.1 [10] survival_2.35-8 tools_2.10.1 > -- MFT?Services University?of?T?bingen Calwerstr.?7 72076??T?bingen/GERMANY Tel.:?+49 7071?29?83210 Fax.?+?49 7071?29?5228 web: www.mft-services.de Confidentiality?Note: This?message?is?intended?only?for?the?use?of?the?named?recipient(s)?an d?may contain?confidential?and/or?proprietary?information.?If?you?are?not?th e?intended recipient,?please?contact?the?sender?and?delete?the?message.?Any?unaut horized use?of?the?information?contained?in?this?message?is?prohibited. -----Urspr?ngliche Nachricht----- Von: Robert Castelo <robert.castelo at="" upf.edu=""> Gesendet: 08.07.2010 13:16:16 An: "Hernando Mart?nez" <hernybiotec at="" gmail.com=""> Betreff: Re: [BioC] removal of genes with low expression values >Hernando, > >if by removing genes with low expression values you mean removing genes >that are not expressed or have an undetectable level of expression with >the microarray instrument employed, the issue is not as easy as it may >look like if you need to do it very accurately because different genes >may behave differently due to complicated probe-specific effects. > >i would encourage you to take a look at this publication: > >Zilliox MJ, Irizarry RA >A gene expression bar code for microarray data. >Nat Methods. 2007 Nov;4(11):911-3. >http://www.ncbi.nlm.nih.gov/pubmed/17906632 >http://rafalab.jhsph.edu/barcode > > >i'm not the authority about this, and the authors of the work could give >you a more specific advice, but i'd say that the take-home message is >that if you have a sufficiently large panel of biological replicates for >a variety of cellular states, try to use them to find a gene- dependent >threshold of expression under the assumption that only a small fraction >of genes is expressed under a specific cellular state. > >cheers, >robert. > > > >On Wed, 2010-07-07 at 11:28 +0100, Hernando Mart?nez wrote: >> Hi, I need to remove genes with low expression values from a expression >> matrix. I would like to remove those with an average of expression values >> less than a certain cut-off. I was thinking in computing the average for >> each row, create a list with the gene names for which their average is less >> than the cut-off, and remove those genes from the initial matrix. However, I >> have a couple of doubts that maybe you can help me with. Is there any >> package or function that makes this easier? And, does anyone know which >> cut-off to use for data normalized with RMA and for data normalized with >> MAS5? Thanks, >> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > >_______________________________________________ >Bioconductor mailing list >Bioconductor at stat.math.ethz.ch >https://stat.ethz.ch/mailman/listinfo/bioconductor >Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLY
0
Entering edit mode
Hi Mike, On 7/8/2010 8:40 AM, Mike Walter wrote: > Hi there, > > I have a question on Affymetrix Yeast 2.0 arrays. These arrays > contain about 5000 probesets for budding and fission yeast, > respectively. We hybridized budding yeast sample and now I'd like to > get rid of all the fission yeast probes before normalization. There > is a mask file outlining which probesets belong to which organism. > However, when I use ReadAffy(..., rm.mask=T) the number of probes > does not change (s. below). Is there a way to specify a mask file > during the affybatch import? I assume you are using the mask file when scanning the chip? If so, the probes for fission yeast should all be NA, which your test below will not detect (having a bunch of NA probes won't affect the dimension of your pm matrices). Something like pms <- pm(d1, LISTRUE = TRUE) length(pms) - sum(sapply(pms, function(x) allis.na(x))) or now that I think about it, apply(pm(d1), 2, function(x) sum(!is.na(x))) should give you a measure of how many non NA probesets remain. Best, Jim > > Any hints are welcome. > > Mike > >> d1<- ReadAffy(celfile.path=celfile, filenames=filenames, >> rm.mask=F) d2<- ReadAffy(celfile.path=celfile, filenames=filenames, >> rm.mask=T) dim(pm(d1)) > [1] 120855 5 >> dim(pm(d2)) > [1] 120855 5 >> sessionInfo() > R version 2.10.1 (2009-12-14) i386-pc-mingw32 > > locale: [1] LC_COLLATE=German_Germany.1252 > LC_CTYPE=German_Germany.1252 [3] LC_MONETARY=German_Germany.1252 > LC_NUMERIC=C [5] LC_TIME=German_Germany.1252 > > attached base packages: [1] stats graphics grDevices utils > datasets methods base > > other attached packages: [1] yeast2cdf_2.5.0 affyQCReport_1.24.0 > lattice_0.18-3 [4] RColorBrewer_1.0-2 affyPLM_1.22.0 > preprocessCore_1.8.0 [7] xtable_1.5-6 simpleaffy_2.22.0 > gcrma_2.18.1 [10] genefilter_1.28.2 affy_1.24.2 > Biobase_2.6.1 > > loaded via a namespace (and not attached): [1] affyio_1.14.0 > annotate_1.24.1 AnnotationDbi_1.8.2 [4] Biostrings_2.14.12 > DBI_0.2-5 grid_2.10.1 [7] IRanges_1.4.16 RSQLite_0.9-1 > splines_2.10.1 [10] survival_2.35-8 tools_2.10.1 >> > -- James W. MacDonald, M.S. Biostatistician Douglas Lab University of Michigan Department of Human Genetics 5912 Buhl 1241 E. Catherine St. Ann Arbor MI 48109-5618 734-615-7826 ********************************************************** Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues
ADD REPLY
0
Entering edit mode
Hi Jim, Unfortunately, we did not use the mask file when scanning the chip. So there is not a single NA in the PM values. So I will go back to the dat file and try to generate a new cel file. Thanks a lot and kind regards, Mike -----Urspr?ngliche Nachricht----- Von: "James W. MacDonald" <jmacdon at="" med.umich.edu=""> Gesendet: 08.07.2010 15:43:26 An: Mike Walter <michael_walter at="" email.de=""> Betreff: Re: [BioC] Yeast 2.0 Arrays >Hi Mike, > >On 7/8/2010 8:40 AM, Mike Walter wrote: >> Hi there, >> >> I have a question on Affymetrix Yeast 2.0 arrays. These arrays >> contain about 5000 probesets for budding and fission yeast, >> respectively. We hybridized budding yeast sample and now I'd like to >> get rid of all the fission yeast probes before normalization. There >> is a mask file outlining which probesets belong to which organism. >> However, when I use ReadAffy(..., rm.mask=T) the number of probes >> does not change (s. below). Is there a way to specify a mask file >> during the affybatch import? > >I assume you are using the mask file when scanning the chip? > >If so, the probes for fission yeast should all be NA, which your test >below will not detect (having a bunch of NA probes won't affect the >dimension of your pm matrices). Something like > >pms <- pm(d1, LISTRUE = TRUE) > >length(pms) - sum(sapply(pms, function(x) allis.na(x))) > >or now that I think about it, > >apply(pm(d1), 2, function(x) sum(!is.na(x))) > >should give you a measure of how many non NA probesets remain. > >Best, > >Jim > > >> >> Any hints are welcome. >> >> Mike >> >>> d1<- ReadAffy(celfile.path=celfile, filenames=filenames, >>> rm.mask=F) d2<- ReadAffy(celfile.path=celfile, filenames=filenames, >>> rm.mask=T) dim(pm(d1)) >> [1] 120855 5 >>> dim(pm(d2)) >> [1] 120855 5 >>> sessionInfo() >> R version 2.10.1 (2009-12-14) i386-pc-mingw32 >> >> locale: [1] LC_COLLATE=German_Germany.1252 >> LC_CTYPE=German_Germany.1252 [3] LC_MONETARY=German_Germany.1252 >> LC_NUMERIC=C [5] LC_TIME=German_Germany.1252 >> >> attached base packages: [1] stats graphics grDevices utils >> datasets methods base >> >> other attached packages: [1] yeast2cdf_2.5.0 affyQCReport_1.24.0 >> lattice_0.18-3 [4] RColorBrewer_1.0-2 affyPLM_1.22.0 >> preprocessCore_1.8.0 [7] xtable_1.5-6 simpleaffy_2.22.0 >> gcrma_2.18.1 [10] genefilter_1.28.2 affy_1.24.2 >> Biobase_2.6.1 >> >> loaded via a namespace (and not attached): [1] affyio_1.14.0 >> annotate_1.24.1 AnnotationDbi_1.8.2 [4] Biostrings_2.14.12 >> DBI_0.2-5 grid_2.10.1 [7] IRanges_1.4.16 RSQLite_0.9-1 >> splines_2.10.1 [10] survival_2.35-8 tools_2.10.1 >>> >> > >-- >James W. MacDonald, M.S. >Biostatistician >Douglas Lab >University of Michigan >Department of Human Genetics >5912 Buhl >1241 E. Catherine St. >Ann Arbor MI 48109-5618 >734-615-7826 >********************************************************** >Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues
ADD REPLY
0
Entering edit mode
Hi Jim, I thought of two workarounds to mask the S pombe probes with the code attached below. However, in both attempt R crashes, when I call the rma() function either with the NAs introduced in the Affybatch or the subset option. I guess this would also happen, when I use the mask file to generate the cel file and use rm.mask=T in the ReadAffy() call? Do you (or anyone else) have any explanation for the breakdown of R? Btw, mas5() also gives an error on option 1 due to NAs. Kind regards, Mike Code for 1st Option: library(affy) celfile = "X:\\affy\\array\\2010\\E10R027" filenames = list.files(path=celfile) filenames = filenames[grep(".CEL", filenames)] filenames d1 = ReadAffy(celfile.path=celfile, filenames=filenames) msk=read.table("F:/Auswertung/Array Annotationen/Affy/s_cerevisiae.msk", header=F, skip=2, row.names=1) pombe = c(as.vector(unlist(pmindex(d1)[rownames(msk)])), as.vector(unlist(mmindex(d1)[rownames(msk)]))) exprs(d1)[pombe,]=NA apply(pm(d1), 2, function(x) sum(!is.na(x))) #drops from 120855 to 65552 features d1n = rma(d1) #R crashes completely with following error information: AppName: rgui.exe AppVer: 2.100.50208.0 ModName: preprocesscore.dll ModVer: 0.0.0.0?? ? Offset: 00017253 Code for 2nd Option: d2 = ReadAffy(celfile.path=celfile, filenames=filenames) msk.sp=read.table("F:/Auswertung/Array Annotationen/Affy/s_pombe.msk", header=F, skip=2, row.names=1) head(msk.sp) d2n = rma(d2, subset=rownames(msk.sp)) #again R crashes completely!!! AppName: rgui.exe AppVer: 2.101.50720.0 ModName: preprocesscore.dll ModVer: 0.0.0.0 Offset: 00017253 > sessionInfo() R version 2.10.1 (2009-12-14) i386-pc-mingw32 locale: [1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252 [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C [5] LC_TIME=German_Germany.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] yeast2cdf_2.5.0 affy_1.24.2 Biobase_2.6.1 loaded via a namespace (and not attached): [1] affyio_1.14.0 preprocessCore_1.8.0 tools_2.10.1 -- MFT?Services University?of?T?bingen Calwerstr.?7 72076??T?bingen/GERMANY Tel.:?+49 7071?29?83210 Fax.?+?49 7071?29?5228 web: www.mft-services.de Confidentiality?Note: This?message?is?intended?only?for?the?use?of?the?named?recipient(s)?an d?may contain?confidential?and/or?proprietary?information.?If?you?are?not?th e?intended recipient,?please?contact?the?sender?and?delete?the?message.?Any?unaut horized use?of?the?information?contained?in?this?message?is?prohibited. -----Urspr?ngliche Nachricht----- Von: "James W. MacDonald" <jmacdon at="" med.umich.edu=""> Gesendet: 08.07.2010 15:43:26 An: Mike Walter <michael_walter at="" email.de=""> Betreff: Re: [BioC] Yeast 2.0 Arrays >Hi Mike, > >On 7/8/2010 8:40 AM, Mike Walter wrote: >> Hi there, >> >> I have a question on Affymetrix Yeast 2.0 arrays. These arrays >> contain about 5000 probesets for budding and fission yeast, >> respectively. We hybridized budding yeast sample and now I'd like to >> get rid of all the fission yeast probes before normalization. There >> is a mask file outlining which probesets belong to which organism. >> However, when I use ReadAffy(..., rm.mask=T) the number of probes >> does not change (s. below). Is there a way to specify a mask file >> during the affybatch import? > >I assume you are using the mask file when scanning the chip? > >If so, the probes for fission yeast should all be NA, which your test >below will not detect (having a bunch of NA probes won't affect the >dimension of your pm matrices). Something like > >pms <- pm(d1, LISTRUE = TRUE) > >length(pms) - sum(sapply(pms, function(x) allis.na(x))) > >or now that I think about it, > >apply(pm(d1), 2, function(x) sum(!is.na(x))) > >should give you a measure of how many non NA probesets remain. > >Best, > >Jim > > >> >> Any hints are welcome. >> >> Mike >> >>> d1<- ReadAffy(celfile.path=celfile, filenames=filenames, >>> rm.mask=F) d2<- ReadAffy(celfile.path=celfile, filenames=filenames, >>> rm.mask=T) dim(pm(d1)) >> [1] 120855 5 >>> dim(pm(d2)) >> [1] 120855 5 >>> sessionInfo() >> R version 2.10.1 (2009-12-14) i386-pc-mingw32 >> >> locale: [1] LC_COLLATE=German_Germany.1252 >> LC_CTYPE=German_Germany.1252 [3] LC_MONETARY=German_Germany.1252 >> LC_NUMERIC=C [5] LC_TIME=German_Germany.1252 >> >> attached base packages: [1] stats graphics grDevices utils >> datasets methods base >> >> other attached packages: [1] yeast2cdf_2.5.0 affyQCReport_1.24.0 >> lattice_0.18-3 [4] RColorBrewer_1.0-2 affyPLM_1.22.0 >> preprocessCore_1.8.0 [7] xtable_1.5-6 simpleaffy_2.22.0 >> gcrma_2.18.1 [10] genefilter_1.28.2 affy_1.24.2 >> Biobase_2.6.1 >> >> loaded via a namespace (and not attached): [1] affyio_1.14.0 >> annotate_1.24.1 AnnotationDbi_1.8.2 [4] Biostrings_2.14.12 >> DBI_0.2-5 grid_2.10.1 [7] IRanges_1.4.16 RSQLite_0.9-1 >> splines_2.10.1 [10] survival_2.35-8 tools_2.10.1 >>> >> > >-- >James W. MacDonald, M.S. >Biostatistician >Douglas Lab >University of Michigan >Department of Human Genetics >5912 Buhl >1241 E. Catherine St. >Ann Arbor MI 48109-5618 >734-615-7826 >********************************************************** >Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues
ADD REPLY
0
Entering edit mode
Hi Mike, On 7/9/2010 8:20 AM, Mike Walter wrote: > Hi Jim, > > I thought of two workarounds to mask the S pombe probes with the code > attached below. However, in both attempt R crashes, when I call the > rma() function either with the NAs introduced in the Affybatch or the > subset option. I guess this would also happen, when I use the mask > file to generate the cel file and use rm.mask=T in the ReadAffy() > call? Do you (or anyone else) have any explanation for the breakdown > of R? Btw, mas5() also gives an error on option 1 due to NAs. > > Kind regards, > > Mike > > Code for 1st Option: > > library(affy) celfile = "X:\\affy\\array\\2010\\E10R027" filenames = > list.files(path=celfile) filenames = filenames[grep(".CEL", > filenames)] filenames > > d1 = ReadAffy(celfile.path=celfile, filenames=filenames) > msk=read.table("F:/Auswertung/Array > Annotationen/Affy/s_cerevisiae.msk", header=F, skip=2, row.names=1) > pombe = c(as.vector(unlist(pmindex(d1)[rownames(msk)])), > as.vector(unlist(mmindex(d1)[rownames(msk)]))) exprs(d1)[pombe,]=NA > apply(pm(d1), 2, function(x) sum(!is.na(x))) #drops from 120855 to > 65552 features d1n = rma(d1) > > #R crashes completely with following error information: > > AppName: rgui.exe AppVer: 2.100.50208.0 ModName: > preprocesscore.dll ModVer: 0.0.0.0 Offset: 00017253 Yeah, that won't work. You can't have NA values, which is why the rm.mask argument exists for ReadAffy, to remove the NA values prior to running rma(). > > Code for 2nd Option: > > d2 = ReadAffy(celfile.path=celfile, filenames=filenames) > msk.sp=read.table("F:/Auswertung/Array > Annotationen/Affy/s_pombe.msk", header=F, skip=2, row.names=1) > head(msk.sp) > > d2n = rma(d2, subset=rownames(msk.sp)) I wonder if some of the things in the mask file don't actually exist on the chip? If I try this using the Dilution data, there are no problems: > rma(Dilution, subset = featureNames(Dilution)[1:250]) Background correcting Normalizing Calculating Expression ExpressionSet (storageMode: lockedEnvironment) assayData: 250 features, 4 samples element names: exprs protocolData: none phenoData sampleNames: 20A, 20B, 10A, 10B varLabels and varMetadata description: liver: amount of liver RNA hybridized to array in micrograms sn19: amount of central nervous system RNA hybridized to array in micrograms scanner: ID number of scanner used featureData: none experimentData: use 'experimentData(object)' Annotation: hgu95av2 However, if I add some random fake probeset ID, I get a segfault as well: > rma(Dilution, subset = c(featureNames(Dilution)[1:250], "10432_s_at")) Background correcting Process C:/R-patched/bin/rterm.exe exited abnormally with code 5 at Fri Jul 09 10:40:12 2010 So I would first check that all the row names of your msk.sp data.frame are actually featureNames of your AffyBatch: all(rownames(msk.sp) %in% featureNames(d2)) and if not, subset the offending rownames first. Best, Jim > > #again R crashes completely!!! > > AppName: rgui.exe AppVer: 2.101.50720.0 ModName: > preprocesscore.dll ModVer: 0.0.0.0 Offset: 00017253 > >> sessionInfo() > R version 2.10.1 (2009-12-14) i386-pc-mingw32 > > locale: [1] LC_COLLATE=German_Germany.1252 > LC_CTYPE=German_Germany.1252 [3] LC_MONETARY=German_Germany.1252 > LC_NUMERIC=C [5] LC_TIME=German_Germany.1252 > > attached base packages: [1] stats graphics grDevices utils > datasets methods base > > other attached packages: [1] yeast2cdf_2.5.0 affy_1.24.2 > Biobase_2.6.1 > > loaded via a namespace (and not attached): [1] affyio_1.14.0 > preprocessCore_1.8.0 tools_2.10.1 > > > > > -- James W. MacDonald, M.S. Biostatistician Douglas Lab University of Michigan Department of Human Genetics 5912 Buhl 1241 E. Catherine St. Ann Arbor MI 48109-5618 734-615-7826 ********************************************************** Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues
ADD REPLY
0
Entering edit mode
Hi Jim, >I wonder if some of the things in the mask file don't actually exist on >the chip? If I try this using the Dilution data, there are no problems: > > > rma(Dilution, subset = featureNames(Dilution)[1:250]) >Background correcting >Normalizing >Calculating Expression >ExpressionSet (storageMode: lockedEnvironment) >assayData: 250 features, 4 samples > element names: exprs >protocolData: none >phenoData > sampleNames: 20A, 20B, 10A, 10B > varLabels and varMetadata description: > liver: amount of liver RNA hybridized to array in micrograms > sn19: amount of central nervous system RNA hybridized to array in >micrograms > > scanner: ID number of scanner used >featureData: none >experimentData: use 'experimentData(object)' >Annotation: hgu95av2 > >However, if I add some random fake probeset ID, I get a segfault as well: > > > rma(Dilution, subset = c(featureNames(Dilution)[1:250], "10432_s_at")) >Background correcting > >Process C:/R-patched/bin/rterm.exe exited abnormally with code 5 at Fri >Jul 09 10:40:12 2010 > >So I would first check that all the row names of your msk.sp data.frame >are actually featureNames of your AffyBatch: > >all(rownames(msk.sp) %in% featureNames(d2)) > >and if not, subset the offending rownames first. > >Best, > >Jim You are perfectly right. There is one probeset in the S. pombe mask file that does not exist in the affybatch feature names. By excluding this I can run RMA on the S. cerevisiae probes only. Thanks a lot and best wishes, Mike
ADD REPLY

Login before adding your answer.

Traffic: 835 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6