Filtering out duplicate probes in Affy data
1
0
Entering edit mode
@hsharm03studentspolyedu-5225
Last seen 10.2 years ago
Dear List, I have a set of mouse affy data. They platform is Affy mouse 430a2 chip. There are 8 samples , 4 for each condition. I normalized the data using rma. The array has 22090 probes originally. Then, in order to filter out the genes which have no entrez id, are duplicates for the same gene, I used the following command . filter <- nsFilter(eset1,require.entrez=T,remove.dupEntrez=T,var.func= IQR,var.filter=T) This leaves me with 6579 genes after filtering. I think I loose many of the genes here. Is there a better way to do the same?. Also, the other problem that I am facing is that after this step, I create a expression matrix of these remaining 6579 probes. Now, in order to annotate them, I use the library mouse4302.db I select the ids from my list and then use the following command Symbol <- mouse4302SYMBOL[ids] This gives me a lesser number of probes and genes. I loose more data here. For example, I am interested in the gene DCK, I check the original annotation file of affymetrix and there are 3 probes that are present for this gene. That means that it should have annotation. But in the final dataset I do not find it. Can anyone suggest a better method or any corrections to the approach that I am using. I eventually need to merge this data with other data from affy and check for the expression values but, i figured out that I am not getting the right amount of genes. Any help is much appreciated. I am a newbie to R and Biconductor, so I am sorry if it is a basic question. Thank you all in advance for your help. Thanks, Himanshu [[alternative HTML version deleted]]
Annotation mouse4302 annotate affy Annotation mouse4302 annotate affy • 2.6k views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 10 hours ago
United States
Hi Himanshu, On 1/11/2013 4:57 PM, Himanshu Sharma wrote: > Dear List, > I have a set of mouse affy data. They platform is Affy mouse 430a2 chip. > There are 8 samples , 4 for each condition. > I normalized the data using rma. The array has 22090 probes originally. > Then, in order to filter out the genes which have no entrez id, are duplicates for the same gene, I used the following command . > > filter<- nsFilter(eset1,require.entrez=T,remove.dupEntrez=T,var.func =IQR,var.filter=T) You are filtering on three things here. First you require that all probesets have an Entrez Gene ID, then you remove any duplicates, then you require that the inter quartile range of the remaining data be greater than 0.5. This is one way of doing things. Depending on your goals, there may be better or worse things you could do, but that depends on your goals. If for instance you don't want to lose DAX, regardless of possible low variation, you could not filter on variation. But 'better' is a subjective term, and you are the only one who can decide what is better or worse in your particular situation. > > This leaves me with 6579 genes after filtering. I think I loose many of the genes here. Is there a better way to do the same?. > > Also, the other problem that I am facing is that after this step, I create a expression matrix of these remaining 6579 probes. > > Now, in order to annotate them, I use the library mouse4302.db > I select the ids from my list and then use the following command > Symbol<- mouse4302SYMBOL[ids] > > This gives me a lesser number of probes and genes. I loose more data here. > For example, I am interested in the gene DCK, I check the original annotation file of affymetrix and there are 3 probes that are present for this gene. That means that it should have annotation. But in the final dataset I do not find it. > > Can anyone suggest a better method or any corrections to the approach that I am using. I eventually need to merge this data with other data from affy and check for the expression values but, i figured out that I am not getting the right amount of genes. There is no such thing as 'the right amount of genes'. There are only assumptions and tradeoffs. You can make the assumption that genes with an IQR < 0.5 are not really changing enough to consider, and then filter them out. Or you can assume that smaller variation is still biologically meaningful, and reduce the IQR cutoff, or eliminate entirely. Or you can assume that duplicated genes on the Affy Mouse 430 chip are really measuring different splice variants or some such, and you want to keep them all in the data set. All these assumptions have tradeoffs, including the possibility that you are wrong and you are polluting your dataset with noise, or unnecessarily increasing the multiplicity of your comparisons. But in the end it is up to the analyst to decide what assumptions are to be made, and to be prepared to defend those assumptions to those higher up (your PI, your funding source, journal reviewers, whomever). Best, Jim > > Any help is much appreciated. I am a newbie to R and Biconductor, so I am sorry if it is a basic question. > Thank you all in advance for your help. > Thanks, > Himanshu > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099
ADD COMMENT
0
Entering edit mode
Thanks a lot James. I really appreciate your help. Also, when I annotate the ids, there should be equal number of probes as after filtering.? How do I lose more when I annotate them?. Thanks, Himanshu From: James W. MacDonald [jmacdon@uw.edu] Sent: Saturday, January 12, 2013 9:48 AM To: Himanshu Sharma Cc: bioconductor at r-project.org mailman Subject: Re: [BioC] Filtering out duplicate probes in Affy data Hi Himanshu, On 1/11/2013 4:57 PM, Himanshu Sharma wrote: > Dear List, > I have a set of mouse affy data. They platform is Affy mouse 430a2 chip. > There are 8 samples , 4 for each condition. > I normalized the data using rma. The array has 22090 probes originally. > Then, in order to filter out the genes which have no entrez id, are duplicates for the same gene, I used the following command . > > filter<- nsFilter(eset1,require.entrez=T,remove.dupEntrez=T,var.func =IQR,var.filter=T) You are filtering on three things here. First you require that all probesets have an Entrez Gene ID, then you remove any duplicates, then you require that the inter quartile range of the remaining data be greater than 0.5. This is one way of doing things. Depending on your goals, there may be better or worse things you could do, but that depends on your goals. If for instance you don't want to lose DAX, regardless of possible low variation, you could not filter on variation. But 'better' is a subjective term, and you are the only one who can decide what is better or worse in your particular situation. > > This leaves me with 6579 genes after filtering. I think I loose many of the genes here. Is there a better way to do the same?. > > Also, the other problem that I am facing is that after this step, I create a expression matrix of these remaining 6579 probes. > > Now, in order to annotate them, I use the library mouse4302.db > I select the ids from my list and then use the following command > Symbol<- mouse4302SYMBOL[ids] > > This gives me a lesser number of probes and genes. I loose more data here. > For example, I am interested in the gene DCK, I check the original annotation file of affymetrix and there are 3 probes that are present for this gene. That means that it should have annotation. But in the final dataset I do not find it. > > Can anyone suggest a better method or any corrections to the approach that I am using. I eventually need to merge this data with other data from affy and check for the expression values but, i figured out that I am not getting the right amount of genes. There is no such thing as 'the right amount of genes'. There are only assumptions and tradeoffs. You can make the assumption that genes with an IQR < 0.5 are not really changing enough to consider, and then filter them out. Or you can assume that smaller variation is still biologically meaningful, and reduce the IQR cutoff, or eliminate entirely. Or you can assume that duplicated genes on the Affy Mouse 430 chip are really measuring different splice variants or some such, and you want to keep them all in the data set. All these assumptions have tradeoffs, including the possibility that you are wrong and you are polluting your dataset with noise, or unnecessarily increasing the multiplicity of your comparisons. But in the end it is up to the analyst to decide what assumptions are to be made, and to be prepared to defend those assumptions to those higher up (your PI, your funding source, journal reviewers, whomever). Best, Jim > > Any help is much appreciated. I am a newbie to R and Biconductor, so I am sorry if it is a basic question. > Thank you all in advance for your help. > Thanks, > Himanshu > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099
ADD REPLY
0
Entering edit mode
Hi Himanshu, On 1/12/2013 10:27 AM, Himanshu Sharma wrote: > Thanks a lot James. I really appreciate your help. Also, when I annotate the ids, there should be equal number of probes as after filtering.? How do I lose more when I annotate them?. Two possible reasons. First, having an Entrez Gene ID doesn't necessarily imply having a gene symbol. Second, and more likely, the moe430a.db package masks all probe -> symbol mappings where there are multiple symbols. But exposing multiple probe -> symbol mappings adds an additional level of complexity. As an example: > library(moe430a.db) > x <- toggleProbes(moe430aSYMBOL, "all") > x <- as.list(x) > x[sapply(x, length) > 1][1:10] $`1415716_a_at` [1] "Rps27" "Gm9846" $`1415763_a_at` [1] "Tmem234" "LOC100505293" $`1415781_a_at` [1] "Sumo2" "Gm13430" $`1415788_at` [1] "Gm12663" "Ublcp1" $`1415789_a_at` [1] "Gm12663" "Ublcp1" $`1415790_at` [1] "Gm12663" "Ublcp1" $`1415825_s_at` [1] "Slc38a10" "Csnk1d" $`1415875_at` [1] "Fam60a" "3010003L21Rik" $`1415895_at` [1] "Snrpn" "Snurf" $`1415896_x_at` [1] "Snrpn" "Snurf" So now which symbol do you use? The first one? Is 1415896_x_at Snrpn or Snurf? Are these symbols synonyms? Without checking each one you are left with ad hoc decisions. You could just use the first one, but then you choose Gm12663 over Ublcp1 for probeset 1415790_at, which looks like the opposite of what you should be doing. As they say, ignorance is bliss. The more you know about this stuff, the messier it gets, and the less clear it is what the 'right' thing to do might be. Or maybe that should be the 'right' thing to do in light of limited time to spend chasing your tail. Best, Jim > Thanks, > Himanshu > From: James W. MacDonald [jmacdon at uw.edu] > Sent: Saturday, January 12, 2013 9:48 AM > To: Himanshu Sharma > Cc: bioconductor at r-project.org mailman > Subject: Re: [BioC] Filtering out duplicate probes in Affy data > > Hi Himanshu, > > On 1/11/2013 4:57 PM, Himanshu Sharma wrote: >> Dear List, >> I have a set of mouse affy data. They platform is Affy mouse 430a2 chip. >> There are 8 samples , 4 for each condition. >> I normalized the data using rma. The array has 22090 probes originally. >> Then, in order to filter out the genes which have no entrez id, are duplicates for the same gene, I used the following command . >> >> filter<- nsFilter(eset1,require.entrez=T,remove.dupEntrez=T,var.fun c=IQR,var.filter=T) > You are filtering on three things here. First you require that all > probesets have an Entrez Gene ID, then you remove any duplicates, then > you require that the inter quartile range of the remaining data be > greater than 0.5. > > This is one way of doing things. Depending on your goals, there may be > better or worse things you could do, but that depends on your goals. If > for instance you don't want to lose DAX, regardless of possible low > variation, you could not filter on variation. > > But 'better' is a subjective term, and you are the only one who can > decide what is better or worse in your particular situation. > >> This leaves me with 6579 genes after filtering. I think I loose many of the genes here. Is there a better way to do the same?. >> >> Also, the other problem that I am facing is that after this step, I create a expression matrix of these remaining 6579 probes. >> >> Now, in order to annotate them, I use the library mouse4302.db >> I select the ids from my list and then use the following command >> Symbol<- mouse4302SYMBOL[ids] >> >> This gives me a lesser number of probes and genes. I loose more data here. >> For example, I am interested in the gene DCK, I check the original annotation file of affymetrix and there are 3 probes that are present for this gene. That means that it should have annotation. But in the final dataset I do not find it. >> >> Can anyone suggest a better method or any corrections to the approach that I am using. I eventually need to merge this data with other data from affy and check for the expression values but, i figured out that I am not getting the right amount of genes. > There is no such thing as 'the right amount of genes'. There are only > assumptions and tradeoffs. You can make the assumption that genes with > an IQR< 0.5 are not really changing enough to consider, and then filter > them out. Or you can assume that smaller variation is still biologically > meaningful, and reduce the IQR cutoff, or eliminate entirely. Or you can > assume that duplicated genes on the Affy Mouse 430 chip are really > measuring different splice variants or some such, and you want to keep > them all in the data set. > > All these assumptions have tradeoffs, including the possibility that you > are wrong and you are polluting your dataset with noise, or > unnecessarily increasing the multiplicity of your comparisons. But in > the end it is up to the analyst to decide what assumptions are to be > made, and to be prepared to defend those assumptions to those higher up > (your PI, your funding source, journal reviewers, whomever). > > Best, > > Jim > > >> Any help is much appreciated. I am a newbie to R and Biconductor, so I am sorry if it is a basic question. >> Thank you all in advance for your help. >> Thanks, >> Himanshu >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- > James W. MacDonald, M.S. > Biostatistician > University of Washington > Environmental and Occupational Health Sciences > 4225 Roosevelt Way NE, # 100 > Seattle WA 98105-6099 > > > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099
ADD REPLY

Login before adding your answer.

Traffic: 846 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6