Non-Specific Filtering with "nsFilter" Question

0

Entering edit mode

zeynep özkeserli ▴ 160

@zeynep-ozkeserli-5250

Last seen 9.4 years ago

Turkey

Hi All, I am trying to apply Non-Specific Filtering to Affymetrix GeneChip hgu133 plus2 data. Since it has been shown that there are multiple probe sets mapping to the same gene in Affymetrix GeneChips (ref: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1784106/), I thought it is necessary to filter those. So I decided to use nsFilter{geneFilter}. First I preprocessed the data, obtained an ExpressionSet object and then I set my criterion as it was suggested as an example for nsFilter. - used require.entrez= TRUE, which filters out features without Entrez Gene ID's. - used remove.dupEntrez=TRUE, which filters features mapping to the same Entrez Gene ID. (I turned off the variance filter to see how many will be removed because of mapping to the same Entrez Gene ID.) And, - first filter removed 13009 features - second filter removed 21629 features. "feature" here being genes. Because this filter is under geneFilter, which filters genes :). Am I wrong? And here are my questions: - If I did not perform the filtering wrongly, is it possible that there are this many duplicates? Or is it really too many? Because in hgu133 arrays data sheet It says that "Analyzes the relative expression level of more than 47,000 transcripts and variants, including more than 38,500 well characterized genes and UniGenes." (ref: http://media.affymetrix.com/support/technical/datasheets/hgu133arrays_ datasheet.pdf ) - Can anybody suggest a mind-map to follow while performing non- specific filtering? I think this must be done very carefully. And another question regarding the filtering process. To my understanding, we should not use features mapping to the same Entrez Gene ID, because they represent non-specific hybridization, thus they give exaggerated signal intensities. So, does it effect preprocessing? If it does, is it meaningful to filter them out after the preprocessing step? Or am I doing it wrong from the first step? Should this filtering be done before the preprocessing? I am a little puzzled here. So any help would be appreciated. Thank you, Zeynep Ozkeserli Ankara University Biotechnology Institute Genomics Unit [[alternative HTML version deleted]]

probe genefilter PROcess probe genefilter PROcess • 1.7k views

ADD COMMENT • link updated 11.9 years ago by James W. MacDonald 65k • written 11.9 years ago by zeynep özkeserli ▴ 160

0

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 2 hours ago

United States

Hi Zeynep, On 6/20/2012 5:18 AM, zeynep ?zkeserli wrote: > Hi All, > > I am trying to apply Non-Specific Filtering to Affymetrix GeneChip hgu133 > plus2 data. > > Since it has been shown that there are multiple probe sets mapping to the > same gene in Affymetrix GeneChips (ref: > http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1784106/), I thought it is > necessary to filter those. So I decided to use nsFilter{geneFilter}. > > First I preprocessed the data, obtained an ExpressionSet object and then I > set my criterion as it was suggested as an example for nsFilter. > > - used require.entrez= TRUE, which filters out features without Entrez Gene > ID's. > - used remove.dupEntrez=TRUE, which filters features mapping to the same > Entrez Gene ID. (I turned off the variance filter to see how many will be > removed because of mapping to the same Entrez Gene ID.) > > And, > > - first filter removed 13009 features > - second filter removed 21629 features. > > "feature" here being genes. Because this filter is under geneFilter, which > filters genes :). Am I wrong? Well, you are sort of wrong. In this context, feature means probeset, and each probeset is designed to interrogate either a gene transcript or a putative gene transcript. > > And here are my questions: > > - If I did not perform the filtering wrongly, is it possible that there are > this many duplicates? Or is it really too many? Because in hgu133 arrays > data sheet It says that "Analyzes the relative expression level of more > than 47,000 transcripts and variants, including more than 38,500 well > characterized genes and UniGenes." > (ref: > http://media.affymetrix.com/support/technical/datasheets/hgu133array s_datasheet.pdf > ) There is no telling if you did it right or wrong, as you neglected to show us your code. What you did and what you think you did may actually be different things. I can tell you this: > length(unique(Rkeys(hgu133plus2ENTREZID))) [1] 42094 So there are 42,094 unique Entrez Gene IDs represented on this array. Note carefully that Affy states '47,000 transcripts and variants', so they include transcript variants in that count, and these transcript variants will by definition have the same Entrez Gene ID. > > - Can anybody suggest a mind-map to follow while performing non- specific > filtering? I think this must be done very carefully. Agreed. I have never personally been fond of non-specific filtering, as to my mind it is a fairly blunt ax where a scalpel is required. Additionally, it is intended to 'fix' problems that I am not sure are either fixable or even exist. For instance, removing duplicated genes assumes that any feature with the same Entrez Gene is by definition intended to measure the same thing. If there were no transcript variants this would be true. But there are transcript variants, so you end up removing things that may well be measuring different things. Not much of a fix IMO. In addition, one rationale for filtering genes is to reduce the number of multiple comparisons. This makes sense to a certain extent if you are simply computing a statistic of some sort and then ranking genes in a univariate manner. I say to a certain extent because things like FDR are monotonic transforms - you aren't changing the order, just moving the cutoff between 'interesting' and 'uninteresting'. That's sort of passe these days - instead of looking for individual genes, we have moved on to looking for perturbed pathways or gene sets, and for that I think removing data is a hindrance not a help. > > And another question regarding the filtering process. > > To my understanding, we should not use features mapping to the same Entrez > Gene ID, because they represent non-specific hybridization, thus they give > exaggerated signal intensities. So, does it effect preprocessing? If it > does, is it meaningful to filter them out after the preprocessing step? Or > am I doing it wrong from the first step? Should this filtering be done > before the preprocessing? I'm not sure where you got that idea, but I think it is wrong. Why would having more than one feature that purports to measure transcript from the same gene represent non-specific hybridization? It might represent duplicate measurement of the same thing, which would be bad because you are increasing the number of comparisons without actually comparing more things. You might be talking about features that might measure more than one transcript, and these may well exist. In fact, the probeset IDs are supposed to alert you to this possibility: http://www.affymetrix.com/support/help/faqs/hgu133_2/faq_7.jsp The short version of that FAQ is that _a_at indicates the probeset may bind to multiple transcripts of the same gene, the _s_at indicates that the probeset may bind to multiple transcripts from the same gene family, and the _x_at indicates that the probeset may bind to multiple transcripts from unrelated genes. For that you can either take these probesets with a grain of salt, or you might look at the MBNI remapped cdfs, which attempt to remove probes that behave poorly. Best, Jim > > I am a little puzzled here. So any help would be appreciated. > > Thank you, > > Zeynep Ozkeserli > Ankara University Biotechnology Institute > Genomics Unit > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

ADD COMMENT • link 11.9 years ago James W. MacDonald 65k

0

Entering edit mode

Hi James, Thank you for your detailed answer which covered all the black holes on this subject on my mind. In fact, the problem started with the control probes. The problem was that, when I performed limma analysis without any filters, the control probes were on top of the differentially expressed gene list. I couldn't find out why, it didn't seem to be an experimental defect (I concluded it from QC Reports). So while I was trying to find out a solution for this, I also started to think on filtering to reduce the number of multiple comparisons (and my misunderstandings on probe design suddenly popped out, sorry for some of the unnecessary questions.) Do you have any idea why control probes would appear to be significantly differentially expressed? Is it logical to just move them? And about getting rid of the "passe" analysis pipeline; does the search for interesting pathways start after deciding "important" genes set or is it another approach which seeks those sets in the whole data set in a different manner? Can you please recommend me any papers where I could learn this approach? Thanks again for your help and comments. Very much appreciated. Zeynep On Wed, Jun 20, 2012 at 5:46 PM, James W. MacDonald <jmacdon@uw.edu> wrote: > Hi Zeynep, > > > On 6/20/2012 5:18 AM, zeynep özkeserli wrote: > >> Hi All, >> >> I am trying to apply Non-Specific Filtering to Affymetrix GeneChip hgu133 >> plus2 data. >> >> Since it has been shown that there are multiple probe sets mapping to the >> same gene in Affymetrix GeneChips (ref: >> http://www.ncbi.nlm.nih.gov/**pmc/articles/PMC1784106/<http: www.n="" cbi.nlm.nih.gov="" pmc="" articles="" pmc1784106=""/>), >> I thought it is >> necessary to filter those. So I decided to use nsFilter{geneFilter}. >> >> First I preprocessed the data, obtained an ExpressionSet object and then I >> set my criterion as it was suggested as an example for nsFilter. >> >> - used require.entrez= TRUE, which filters out features without Entrez >> Gene >> ID's. >> - used remove.dupEntrez=TRUE, which filters features mapping to the same >> Entrez Gene ID. (I turned off the variance filter to see how many will be >> removed because of mapping to the same Entrez Gene ID.) >> >> And, >> >> - first filter removed 13009 features >> - second filter removed 21629 features. >> >> "feature" here being genes. Because this filter is under geneFilter, which >> filters genes :). Am I wrong? >> > > Well, you are sort of wrong. In this context, feature means probeset, and > each probeset is designed to interrogate either a gene transcript or a > putative gene transcript. > > > >> And here are my questions: >> >> - If I did not perform the filtering wrongly, is it possible that there >> are >> this many duplicates? Or is it really too many? Because in hgu133 arrays >> data sheet It says that "Analyzes the relative expression level of more >> than 47,000 transcripts and variants, including more than 38,500 well >> characterized genes and UniGenes." >> (ref: >> http://media.affymetrix.com/**support/technical/datasheets/** >> hgu133arrays_datasheet.pdf<http: media.affymetrix.com="" support="" tech="" nical="" datasheets="" hgu133arrays_datasheet.pdf=""> >> ) >> > > There is no telling if you did it right or wrong, as you neglected to show > us your code. What you did and what you think you did may actually be > different things. I can tell you this: > > > length(unique(Rkeys(**hgu133plus2ENTREZID))) > [1] 42094 > > So there are 42,094 unique Entrez Gene IDs represented on this array. Note > carefully that Affy states '47,000 transcripts and variants', so they > include transcript variants in that count, and these transcript variants > will by definition have the same Entrez Gene ID. > > > >> - Can anybody suggest a mind-map to follow while performing non- specific >> filtering? I think this must be done very carefully. >> > > Agreed. I have never personally been fond of non-specific filtering, as to > my mind it is a fairly blunt ax where a scalpel is required. Additionally, > it is intended to 'fix' problems that I am not sure are either fixable or > even exist. > > For instance, removing duplicated genes assumes that any feature with the > same Entrez Gene is by definition intended to measure the same thing. If > there were no transcript variants this would be true. But there are > transcript variants, so you end up removing things that may well be > measuring different things. Not much of a fix IMO. > > In addition, one rationale for filtering genes is to reduce the number of > multiple comparisons. This makes sense to a certain extent if you are > simply computing a statistic of some sort and then ranking genes in a > univariate manner. I say to a certain extent because things like FDR are > monotonic transforms - you aren't changing the order, just moving the > cutoff between 'interesting' and 'uninteresting'. That's sort of passe > these days - instead of looking for individual genes, we have moved on to > looking for perturbed pathways or gene sets, and for that I think removing > data is a hindrance not a help. > > > >> And another question regarding the filtering process. >> >> To my understanding, we should not use features mapping to the same Entrez >> Gene ID, because they represent non-specific hybridization, thus they give >> exaggerated signal intensities. So, does it effect preprocessing? If it >> does, is it meaningful to filter them out after the preprocessing step? Or >> am I doing it wrong from the first step? Should this filtering be done >> before the preprocessing? >> > > I'm not sure where you got that idea, but I think it is wrong. Why would > having more than one feature that purports to measure transcript from the > same gene represent non-specific hybridization? It might represent > duplicate measurement of the same thing, which would be bad because you are > increasing the number of comparisons without actually comparing more things. > > You might be talking about features that might measure more than one > transcript, and these may well exist. In fact, the probeset IDs are > supposed to alert you to this possibility: > > http://www.affymetrix.com/**support/help/faqs/hgu133_2/**faq_7.jsp<h ttp:="" www.affymetrix.com="" support="" help="" faqs="" hgu133_2="" faq_7.jsp=""> > > The short version of that FAQ is that _a_at indicates the probeset may > bind to multiple transcripts of the same gene, the _s_at indicates that the > probeset may bind to multiple transcripts from the same gene family, and > the _x_at indicates that the probeset may bind to multiple transcripts from > unrelated genes. > > For that you can either take these probesets with a grain of salt, or you > might look at the MBNI remapped cdfs, which attempt to remove probes that > behave poorly. > > Best, > > Jim > > > >> I am a little puzzled here. So any help would be appreciated. >> >> Thank you, >> >> Zeynep Ozkeserli >> Ankara University Biotechnology Institute >> Genomics Unit >> >> [[alternative HTML version deleted]] >> >> ______________________________**_________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.e="" thz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: http://news.gmane.org/gmane.** >> science.biology.informatics.**conductor<http: news.gmane.org="" gmane="" .science.biology.informatics.conductor=""> >> > > -- > James W. MacDonald, M.S. > Biostatistician > University of Washington > Environmental and Occupational Health Sciences > 4225 Roosevelt Way NE, # 100 > Seattle WA 98105-6099 > > [[alternative HTML version deleted]]

ADD REPLY • link 11.9 years ago zeynep özkeserli ▴ 160

0

Entering edit mode

Hi Zeynep, On 6/20/2012 11:47 AM, zeynep ?zkeserli wrote: > Hi James, > > Thank you for your detailed answer which covered all the black holes > on this subject on my mind. > > In fact, the problem started with the control probes. The problem was > that, when I performed limma analysis without any filters, the control > probes were on top of the differentially expressed gene list. I > couldn't find out why, it didn't seem to be an experimental defect (I > concluded it from QC Reports). So while I was trying to find out a > solution for this, I also started to think on filtering to reduce the > number of multiple comparisons (and my misunderstandings on probe > design suddenly popped out, sorry for some of the unnecessary > questions.) Do you have any idea why control probes would appear to be > significantly differentially expressed? Is it logical to just move them? Ugh. I hate when that happens. So, it depends on what you mean by control probes, as there are various types. If you are talking about the beta-actin or other 'housekeeping' genes, then it isn't clear to me if this is a problem or not. The general assumption is that these genes are constituitively up- regulated, and never vary. But I have always wondered about that. It's sort of like the 'no two snow flakes are alike' hypothesis - in general circulation, but by definition untestable. So housekeeping genes make me wonder, but don't really cause much teeth gnashing. The same is true for the 'normalizing control set' of 100 probesets that Affy claim are not differentially expressed in different tissues. I think that really depends. I had one study back in the day where they were comparing normal C. elegans to C. elegans that had some deadly mutation, and something like 95% of the genes were differentially expressed. It was just ridiculous. But the point to me was that you can't know if a gene or set of genes are never affected - it is too context dependent. That said, I would recommend ensuring that everything is OK. I don't know what you mean by QC Reports - perhaps you used the affyQCReport package, or arrayQualityMetrics? I would certainly run these data through one of those packages. I would also do things like PCA plots of the expression values, and maybe image plots that you can generate using the affyPLM package. Now if you have things like the Poly-A controls or the Hybridization controls popping up, then you may have a real problem, as those are spiked in during the processing. This could indicate big technical variability between batches that may not be resolvable. > > And about getting rid of the "passe" analysis pipeline; does the > search for interesting pathways start after deciding "important" genes > set or is it another approach which seeks those sets in the whole data > set in a different manner? Can you please recommend me any papers > where I could learn this approach? Well, the general idea started with Gene Ontology analyses where you take the 'top' genes, based on a cutoff, and try to find GO terms that are over or under-represented in the set of significant genes. The underlying weakness there is that you are relying on a cutoff, which can be fairly arbitrarily set. Another way to think about it is to just take your ranked list of genes (all genes on the chip, ranked by some statistic), and then see if a certain group of genes (where 'group' is defined as an existing gene set that somebody else already found, or a set of genes in a GO category, or what have you) is 'higher up' in the ranked list than would be expected by chance. For this approach you really need to filter down to a set of unique genes, but in general I don't think you filter further. I'm no expert on the literature, but I think one of the seminal papers is by Tian: http://www.pnas.org/content/102/38/13544.short There are also several out of Robert Gentleman's group that I have found helpful. Do a Google Scholar of gsea gentleman, and they will be near the top. Best, Jim > > Thanks again for your help and comments. Very much appreciated. > > Zeynep > > > > On Wed, Jun 20, 2012 at 5:46 PM, James W. MacDonald <jmacdon at="" uw.edu=""> <mailto:jmacdon at="" uw.edu="">> wrote: > > Hi Zeynep, > > > On 6/20/2012 5:18 AM, zeynep ?zkeserli wrote: > > Hi All, > > I am trying to apply Non-Specific Filtering to Affymetrix > GeneChip hgu133 > plus2 data. > > Since it has been shown that there are multiple probe sets > mapping to the > same gene in Affymetrix GeneChips (ref: > http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1784106/), I > thought it is > necessary to filter those. So I decided to use > nsFilter{geneFilter}. > > First I preprocessed the data, obtained an ExpressionSet > object and then I > set my criterion as it was suggested as an example for nsFilter. > > - used require.entrez= TRUE, which filters out features > without Entrez Gene > ID's. > - used remove.dupEntrez=TRUE, which filters features mapping > to the same > Entrez Gene ID. (I turned off the variance filter to see how > many will be > removed because of mapping to the same Entrez Gene ID.) > > And, > > - first filter removed 13009 features > - second filter removed 21629 features. > > "feature" here being genes. Because this filter is under > geneFilter, which > filters genes :). Am I wrong? > > > Well, you are sort of wrong. In this context, feature means > probeset, and each probeset is designed to interrogate either a > gene transcript or a putative gene transcript. > > > > And here are my questions: > > - If I did not perform the filtering wrongly, is it possible > that there are > this many duplicates? Or is it really too many? Because in > hgu133 arrays > data sheet It says that "Analyzes the relative expression > level of more > than 47,000 transcripts and variants, including more than > 38,500 well > characterized genes and UniGenes." > (ref: > http://media.affymetrix.com/support/technical/datasheets/hgu 133arrays_datasheet.pdf > ) > > > There is no telling if you did it right or wrong, as you neglected > to show us your code. What you did and what you think you did may > actually be different things. I can tell you this: > > > length(unique(Rkeys(hgu133plus2ENTREZID))) > [1] 42094 > > So there are 42,094 unique Entrez Gene IDs represented on this > array. Note carefully that Affy states '47,000 transcripts and > variants', so they include transcript variants in that count, and > these transcript variants will by definition have the same Entrez > Gene ID. > > > > - Can anybody suggest a mind-map to follow while performing > non-specific > filtering? I think this must be done very carefully. > > > Agreed. I have never personally been fond of non-specific > filtering, as to my mind it is a fairly blunt ax where a scalpel > is required. Additionally, it is intended to 'fix' problems that I > am not sure are either fixable or even exist. > > For instance, removing duplicated genes assumes that any feature > with the same Entrez Gene is by definition intended to measure the > same thing. If there were no transcript variants this would be > true. But there are transcript variants, so you end up removing > things that may well be measuring different things. Not much of a > fix IMO. > > In addition, one rationale for filtering genes is to reduce the > number of multiple comparisons. This makes sense to a certain > extent if you are simply computing a statistic of some sort and > then ranking genes in a univariate manner. I say to a certain > extent because things like FDR are monotonic transforms - you > aren't changing the order, just moving the cutoff between > 'interesting' and 'uninteresting'. That's sort of passe these days > - instead of looking for individual genes, we have moved on to > looking for perturbed pathways or gene sets, and for that I think > removing data is a hindrance not a help. > > > > And another question regarding the filtering process. > > To my understanding, we should not use features mapping to the > same Entrez > Gene ID, because they represent non-specific hybridization, > thus they give > exaggerated signal intensities. So, does it effect > preprocessing? If it > does, is it meaningful to filter them out after the > preprocessing step? Or > am I doing it wrong from the first step? Should this filtering > be done > before the preprocessing? > > > I'm not sure where you got that idea, but I think it is wrong. Why > would having more than one feature that purports to measure > transcript from the same gene represent non-specific > hybridization? It might represent duplicate measurement of the > same thing, which would be bad because you are increasing the > number of comparisons without actually comparing more things. > > You might be talking about features that might measure more than > one transcript, and these may well exist. In fact, the probeset > IDs are supposed to alert you to this possibility: > > http://www.affymetrix.com/support/help/faqs/hgu133_2/faq_7.jsp > > The short version of that FAQ is that _a_at indicates the probeset > may bind to multiple transcripts of the same gene, the _s_at > indicates that the probeset may bind to multiple transcripts from > the same gene family, and the _x_at indicates that the probeset > may bind to multiple transcripts from unrelated genes. > > For that you can either take these probesets with a grain of salt, > or you might look at the MBNI remapped cdfs, which attempt to > remove probes that behave poorly. > > Best, > > Jim > > > > I am a little puzzled here. So any help would be appreciated. > > Thank you, > > Zeynep Ozkeserli > Ankara University Biotechnology Institute > Genomics Unit > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > -- > James W. MacDonald, M.S. > Biostatistician > University of Washington > Environmental and Occupational Health Sciences > 4225 Roosevelt Way NE, # 100 > Seattle WA 98105-6099 > > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

ADD REPLY • link 11.9 years ago James W. MacDonald 65k

Login before adding your answer.