Another question about normalization of data
1
0
Entering edit mode
@gustavo-fernandez-bayon-5300
Last seen 8.9 years ago
Spain
Hi everybody. Not so long ago, I asked in this list about some normalization issues. The question and its very interesting replies, from which I have learned a lot, can be found here: http://comments.gmane.org/gmane.science.biology.informatics.conductor/ 41812 It seems to me that, the more I am getting into Bioinformatics, the less I know about everything. I usually doubt about everything, and I am always asking, step by step, if I am doing things correctly. Now, I want to test some ideas on a 450K methylation array data base. Main idea is to try to classify probesets in families according to their behavior with respect to some phenotype variables. I have several ideas I would like to try on this data, and the first step has been to import, review, visualize and try to understand the global structure of the beta values I have at my hand. Once loaded, I have made two box plots. One shows the distribution of beta values among the 40 samples, and the other shows the distribution among the first 100 probesets. I have shared the plots at my Google Docs account: https://docs.google.com/open?id=0Bw-_OWjrT9U4cTlZblR0UkVhWG8 https://docs.google.com/open?id=0Bw-_OWjrT9U4d3FTZTQtNWJFUVE My question might sound stupid, but I want to deeply understand what is going on with these plots. For the beta vs. probeset: - I guess the variability is normal. Some probes are methylated most of the time, some not, and there are a lot of differences in their behavior. This is the common behavior, isn't it? - Boxplot might not be the best solution here, because the distribution need not to be unimodal, I think. Am I right? - My intuition is that these values should be normalized in case we were going to use something like SVM-RFE to do probeset selection. Again, is my intuition right? For the beta vs. sample: - Data distribution seems more regular than in the other plot. Is that an effect of the underlying normalization that GenomeStudio does? Or is the way beta values across samples are supposed to behave? - Although they seem regular, there are still small differences among medians, which made me think. Would a normalization of this data benefit following experiments? In general, I would like to know if the plots show a normal behavior, if it is the expected one, or if I should kind of normalize them using a predefined or standard method. Any help or hint will be extremely appreciated. Regards, Gustavo --------------------------- Enviado con Sparrow (http://www.sparrowmailapp.com/?sig)
Normalization Normalization • 1.1k views
ADD COMMENT
0
Entering edit mode
@kasper-daniel-hansen-2979
Last seen 18 months ago
United States
On Wed, Jul 4, 2012 at 4:54 AM, Gustavo Fern?ndez Bay?n <gbayon at="" gmail.com=""> wrote: > Hi everybody. > > Not so long ago, I asked in this list about some normalization issues. The question and its very interesting replies, from which I have learned a lot, can be found here: > > http://comments.gmane.org/gmane.science.biology.informatics.conducto r/41812 > > It seems to me that, the more I am getting into Bioinformatics, the less I know about everything. I usually doubt about everything, and I am always asking, step by step, if I am doing things correctly. > > Now, I want to test some ideas on a 450K methylation array data base. Main idea is to try to classify probesets in families according to their behavior with respect to some phenotype variables. I have several ideas I would like to try on this data, and the first step has been to import, review, visualize and try to understand the global structure of the beta values I have at my hand. > > Once loaded, I have made two box plots. One shows the distribution of beta values among the 40 samples, and the other shows the distribution among the first 100 probesets. > > I have shared the plots at my Google Docs account: > > https://docs.google.com/open?id=0Bw-_OWjrT9U4cTlZblR0UkVhWG8 > https://docs.google.com/open?id=0Bw-_OWjrT9U4d3FTZTQtNWJFUVE > > > > My question might sound stupid, but I want to deeply understand what is going on with these plots. > > For the beta vs. probeset: probesets is a pretty bad word to use here, I think. > - I guess the variability is normal. Some probes are methylated most of the time, some not, and there are a lot of differences in their behavior. This is the common behavior, isn't it? Yes, of course > - Boxplot might not be the best solution here, because the distribution need not to be unimodal, I think. Am I right? Probably > - My intuition is that these values should be normalized in case we were going to use something like SVM-RFE to do probeset selection. Again, is my intuition right? Well, there is no way around the fact that different CpGs will have vastly different behavior. If your model selection procedure needs similar distributions for each feature, you will need to use a different selection tool. > For the beta vs. sample: > > - Data distribution seems more regular than in the other plot. Is that an effect of the underlying normalization that GenomeStudio does? Or is the way beta values across samples are supposed to behave? This is the consequence of now looking at the marginal behavior across the genome (or at least the part of the genome assayed by 450k). > - Although they seem regular, there are still small differences among medians, which made me think. Would a normalization of this data benefit following experiments? Perhaps. Note that the - depending on the question and the samples - you cannot assume that these marginal distributions are the same. For example, in cancer they are know to be very different. See for example Hansen, K. D. et al. Increased methylation variation in epigenetic domains across cancer types. Nat Genet 43, 768?775 (2011). This means you probably have to be smart, like Aryee, M. J. et al. Accurate genome-scale percentage DNA methylation estimates from microarray data. Biostatistics 12, 197?210 (2011). Unfortunately, it is not clear that this trick (using many background probes) can be done on 450k. But perhaps you can assume that these marginal distributions are similar in your experiment. > In general, I would like to know if the plots show a normal behavior, if it is the expected one, or if I should kind of normalize them using a predefined or standard method. This is still an open problem. You should do whatever normalization helps you to improve signal to noise in the context of your analysis. Of course, this is a general observation that doesn't help you much Kasper
ADD COMMENT
0
Entering edit mode
Hi, Kasper. First of all, thank you for your reply. Every hint or helping hand I get from this mailing list is truly appreciated. Downposting below... --------------------------- Enviado con Sparrow (http://www.sparrowmailapp.com/?sig) El mi?rcoles 4 de julio de 2012 a las 16:33, Kasper Daniel Hansen escribi?: > [?] > probesets is a pretty bad word to use here, I think. Why? Should I use the term 'probe' instead? I see I am usually using both of them to refer to the same thing, and maybe I am wrong. > > > - I guess the variability is normal. Some probes are methylated most of the time, some not, and there are a lot of differences in their behavior. This is the common behavior, isn't it? > > Yes, of course Thank you. Obvious things like this tend to blur when trying to learn too many things at the same time. > > - Boxplot might not be the best solution here, because the distribution need not to be unimodal, I think. Am I right? > > > Probably > > > - My intuition is that these values should be normalized in case we were going to use something like SVM-RFE to do probeset selection. Again, is my intuition right? > > Well, there is no way around the fact that different CpGs will have > vastly different behavior. If your model selection procedure needs > similar distributions for each feature, you will need to use a > different selection tool. I agree. But, for example, if you are going to use a non-parametric method like k-nearest neighbours, which is based on distances, probes with greater ranges are going to dominate the distances, hence having more influence than they should, aren't they? Same if you want to classify control vs. case with a SVM or a neural network. > > > For the beta vs. sample: > > > > - Data distribution seems more regular than in the other plot. Is that an effect of the underlying normalization that GenomeStudio does? Or is the way beta values across samples are supposed to behave? > > This is the consequence of now looking at the marginal behavior across > the genome (or at least the part of the genome assayed by 450k). I think you have helped me here to clear my mind a bit. I had not thought about it that way. > > > - Although they seem regular, there are still small differences among medians, which made me think. Would a normalization of this data benefit following experiments? > > Perhaps. Note that the - depending on the question and the samples - > you cannot assume that these marginal distributions are the same. For > example, in cancer they are know to be very different. See for example > Hansen, K. D. et al. Increased methylation variation in epigenetic > domains across cancer types. Nat Genet 43, 768?775 (2011). > This means you probably have to be smart, like > Aryee, M. J. et al. Accurate genome-scale percentage DNA methylation > estimates from microarray data. Biostatistics 12, 197?210 (2011). > Unfortunately, it is not clear that this trick (using many background > probes) can be done on 450k. But perhaps you can assume that these > marginal distributions are similar in your experiment. Thank you for the links. I'll check them to see if I can get things clear. > > > In general, I would like to know if the plots show a normal behavior, if it is the expected one, or if I should kind of normalize them using a predefined or standard method. > > This is still an open problem. You should do whatever normalization > helps you to improve signal to noise in the context of your analysis. > Of course, this is a general observation that doesn't help you much This time, I think you are wrong. ;) I think this general observation helps more than it seems. As a newbie in this field, there are times when I do not know if I am facing an open problem, or if I should use a standard technique. > > Kasper Regards, Gus
ADD REPLY

Login before adding your answer.

Traffic: 879 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6