Question: Regression analysis problem
0
8.0 years ago by
Hi everyone, I am wondering if anyone can help me understand the following output from limma. The problem goes like this: Say I have methylation values for four genes from three twin pairs. I created some dummy data so that values of gene1 is clearly different within twin pairs. My interest here is to find out the genes that are significantly different within the pairs (i.e. between T1a and T1b, T2a and T2b and T3a and T3b)                 T1a T1b T2a T2b T3a T3b "gene1"    16    4    18    4    19    5 "gene2"    3      1     4     5     2     6 "gene3"    2      1     4     3     4     5 "gene4"    3      1     2     3     3    4 I created a design matrix:    > Pair<-c("TWIN1","TWIN1","TWIN2","TWIN2","TWIN3","TWIN3")     >Classes<-c(2,1,2,1,2,1)     >BMI<-c(45,20,34,12,25,19)     >Pair<-factor(Pair)     >Classes<-factor(Classes)     >design <- model.matrix(~Pair+Classes) > design   (Intercept) PairTWIN2 PairTWIN3 Classes2 1           1         0         0        1 2           1         0         0        0 3           1         1         0        1 4           1         1         0        0 5           1         0         1        1 6           1         0         1        0 ********** ...and fitted a linear model to the data to get the following results: ************************************************************ ID X.Intercept. PairTWIN2 PairTWIN3      Classes2   AveExpr          F 1 gene1     3.333333       1.0       2.0  1.333333e+01 11.000000 106.298724 2 gene2     2.500000       2.5       2.0 -1.000000e+00  3.500000 8.745648 3 gene3     1.333333       2.0       3.0  3.333333e-01  3.166667 7.430245 4 gene4     2.000000       0.5       1.5  9.064933e-16  2.666667 4.799441        P.Value    adj.P.Val 1 9.993042e-91 3.997217e-90 2 4.683758e-07 9.367515e-07 3 5.578117e-06 7.437489e-06 4 7.186523e-04 7.186523e-04 ************************************************************* As you can see, all the genes have adjusted p values less than 0.05. But if I just compute the differnces between the pairs              T1 T2 T3 gene1   12  14  14 gene2    2  -1  -4 gene3    1   1  -1 gene4    2  -1  -1 ....and used the design matrix (1,1,1), gene 1 comes up as significantly different within the pairs.  ID      logFC          t      P.Value    adj.P.Val         B 1 gene1 13.3333333 10.6660452 5.234557e-06 2.093823e-05 46.016217 2 gene2 -1.0000000 -0.7999534 4.468388e-01 8.936777e-01 -5.513313 3 gene3  0.3333333  0.2666511 7.964821e-01 1.000000e+00 -5.772418 4 gene4  0.0000000  0.0000000 1.000000e+00 1.000000e+00 -5.804806 So, my questions are: Why am I not getting the same (expected) result from my first approach? Or am I misinterpreting  the output? What does the p value in the 1st approach signify? And most importantly, am I applying the right model? Any help on this would be most appreciated. Thanks in advance, Khadeeja [[alternative HTML version deleted]]
limma • 804 views
modified 8.0 years ago by Sean Davis21k • written 8.0 years ago by khadeeja ismail400
0
8.0 years ago by
Sean Davis21k
United States
Sean Davis21k wrote:
On Thu, Oct 6, 2011 at 7:15 AM, khadeeja ismail <hajjja at="" yahoo.com=""> wrote: > Hi everyone, > > I am wondering if anyone can help me understand the following output from limma. > > The problem goes like this: > Say I have methylation values for four genes from three twin pairs. I created some dummy data so that values of gene1 is clearly different within twin pairs. My interest here is to find out the genes that are significantly different within the pairs (i.e. between T1a and T1b,? T2a and T2b and T3a and T3b) > > > ??? ??? ??? ??? T1a T1b T2a T2b T3a T3b > > "gene1"??? 16??? 4??? 18??? 4??? 19??? 5 > "gene2"??? 3????? 1?? ? 4? ?? 5???? 2???? 6 > "gene3"??? 2????? 1?? ? 4???? 3???? 4? ?? 5 > "gene4"??? 3????? 1?? ? 2???? 3???? 3??? 4 > > > I created a design matrix: > > ?? > Pair<-c("TWIN1","TWIN1","TWIN2","TWIN2","TWIN3","TWIN3") > ??? >Classes<-c(2,1,2,1,2,1) > ??? >BMI<-c(45,20,34,12,25,19) > ??? >Pair<-factor(Pair) > ??? >Classes<-factor(Classes) > ??? >design <- model.matrix(~Pair+Classes) > >> design > ? (Intercept) PairTWIN2 PairTWIN3 Classes2 > 1?????????? 1???????? 0???????? 0??????? 1 > 2?????????? 1???????? 0???????? 0??????? 0 > 3?????????? 1???????? 1???????? 0??????? 1 > 4?????????? 1???????? 1???????? 0??????? 0 > 5?????????? 1???????? 0???????? 1??????? 1 > 6?????????? 1???????? 0???????? 1??????? 0 > > > ********** > > ...and fitted a linear model to the data to get the following results: > ************************************************************ > ID X.Intercept. PairTWIN2 PairTWIN3????? Classes2?? AveExpr????????? F > 1 gene1???? 3.333333?????? 1.0?????? 2.0? 1.333333e+01 11.000000 106.298724 > 2 gene2???? 2.500000?????? 2.5?????? 2.0 -1.000000e+00? 3.500000?? 8.745648 > 3 gene3???? 1.333333?????? 2.0?????? 3.0? 3.333333e-01? 3.166667?? 7.430245 > 4 gene4???? 2.000000?????? 0.5?????? 1.5? 9.064933e-16? 2.666667?? 4.799441 > ?????? P.Value??? adj.P.Val > 1 9.993042e-91 3.997217e-90 > 2 4.683758e-07 9.367515e-07 > 3 5.578117e-06 7.437489e-06 > 4 7.186523e-04 7.186523e-04 > ************************************************************* You didn't show us the rest of the code. In particular, I suspect that you are using something like: topTable(X) You might try: topTable(X,coef=4) Sean > > As you can see, all the genes have adjusted p values less than 0.05. > > But if I just compute the differnces between the pairs > > ???????????? T1 T2 T3 > gene1 ? 12? 14? 14 > gene2??? 2? -1? -4 > gene3 ?? 1?? 1? -1 > gene4??? 2? -1? -1 > > > ....and used the design matrix (1,1,1), > gene 1 comes up as significantly different within the pairs. > > ?ID????? logFC????????? t????? P.Value??? adj.P.Val???????? B > 1 gene1 13.3333333 10.6660452 5.234557e-06 2.093823e-05 46.016217 > 2 gene2 -1.0000000 -0.7999534 4.468388e-01 8.936777e-01 -5.513313 > 3 gene3? 0.3333333? 0.2666511 7.964821e-01 1.000000e+00 -5.772418 > 4 gene4? 0.0000000? 0.0000000 1.000000e+00 1.000000e+00 -5.804806 > > > So, my questions are: > Why am I not getting the same (expected) result from my first approach? Or am I misinterpreting? the output? > > What does the p value in the 1st approach signify? > > And most importantly, am I applying the right model? > > Any help on this would be most appreciated. > > Thanks in advance, > > Khadeeja > ? ? ? ?[[alternative HTML version deleted]] > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >
Got it. Thanks Sean. ________________________________ From: Sean Davis <sdavis2@mail.nih.gov> Cc: "bioconductor@r-project.org" <bioconductor@r-project.org> Sent: Thursday, October 6, 2011 2:25 PM Subject: Re: [BioC] Regression analysis problem > Hi everyone, > > I am wondering if anyone can help me understand the following output from limma. > > The problem goes like this: > Say I have methylation values for four genes from three twin pairs. I created some dummy data so that values of gene1 is clearly different within twin pairs. My interest here is to find out the genes that are significantly different within the pairs (i.e. between T1a and T1b,� T2a and T2b and T3a and T3b) > > > ��� ��� ��� ��� T1a T1b T2a T2b T3a T3b > > "gene1"��� 16��� 4��� 18��� 4��� 19��� 5 > "gene2"��� 3����� 1�� � 4� �� 5���� 2���� 6 > "gene3"��� 2����� 1�� � 4���� 3���� 4� �� 5 > "gene4"��� 3����� 1�� � 2���� 3���� 3��� 4 > > > I created a design matrix: > > �� > Pair<-c("TWIN1","TWIN1","TWIN2","TWIN2","TWIN3","TWIN3") > ��� >Classes<-c(2,1,2,1,2,1) > ��� >BMI<-c(45,20,34,12,25,19) > ��� >Pair<-factor(Pair) > ��� >Classes<-factor(Classes) > ��� >design <- model.matrix(~Pair+Classes) > >> design > � (Intercept) PairTWIN2 PairTWIN3 Classes2 > 1���������� 1�������� 0�������� 0������� 1 > 2���������� 1�������� 0�������� 0������� 0 > 3���������� 1�������� 1�������� 0������� 1 > 4���������� 1�������� 1�������� 0������� 0 > 5���������� 1�������� 0�������� 1������� 1 > 6���������� 1�������� 0�������� 1������� 0 > > > ********** > > ...and fitted a linear model to the data to get the following results: > ************************************************************ > ID X.Intercept. PairTWIN2 PairTWIN3����� Classes2�� AveExpr��������� F > 1 gene1���� 3.333333������ 1.0������ 2.0� 1.333333e+01 11.000000 106.298724 > 2 gene2���� 2.500000������ 2.5������ 2.0 -1.000000e+00� 3.500000�� 8.745648 > 3 gene3���� 1.333333������ 2.0������ 3.0� 3.333333e-01� 3.166667�� 7.430245 > 4 gene4���� 2.000000������ 0.5������ 1.5� 9.064933e-16� 2.666667�� 4.799441 > ������ P.Value��� adj.P.Val > 1 9.993042e-91 3.997217e-90 > 2 4.683758e-07 9.367515e-07 > 3 5.578117e-06 7.437489e-06 > 4 7.186523e-04 7.186523e-04 > ************************************************************* You didn't show us the rest of the code.� In particular, I suspect that you are using something like: topTable(X) You might try: topTable(X,coef=4) Sean > > As you can see, all the genes have adjusted p values less than 0.05. > > But if I just compute the differnces between the pairs > > ������������ T1 T2 T3 > gene1 � 12� 14� 14 > gene2��� 2� -1� -4 > gene3 �� 1�� 1� -1 > gene4��� 2� -1� -1 > > > ....and used the design matrix (1,1,1), > gene 1 comes up as significantly different within the pairs. > > �ID����� logFC��������� t����� P.Value��� adj.P.Val�������� B > 1 gene1 13.3333333 10.6660452 5.234557e-06 2.093823e-05 46.016217 > 2 gene2 -1.0000000 -0.7999534 4.468388e-01 8.936777e-01 -5.513313 > 3 gene3� 0.3333333� 0.2666511 7.964821e-01 1.000000e+00 -5.772418 > 4 gene4� 0.0000000� 0.0000000 1.000000e+00 1.000000e+00 -5.804806 > > > So, my questions are: > Why am I not getting the same (expected) result from my first approach? Or am I misinterpreting� the output? > > What does the p value in the 1st approach signify? > > And most importantly, am I applying the right model? > > Any help on this would be most appreciated. > > Thanks in advance, > > Khadeeja > � � � �[[alternative HTML version deleted]] > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
>In particular, I suspect >that you are using something like: >topTable(X) >You might try: >topTable(X,coef=4) It worked with the small dataset. When I tried the same with a larger set (11 pairs and and some 150k genes) I'm getting different results from the two approaches (even with using  topTable(X, coef=12) in approach 1). I'm finding no significant genes from the first approach, but using the second approach I'm finding some. Is it possible? I thought I was doing the same thing in two different ways. Have I overlooked something again? Thanks, Khadeeja Targets<-readTargets(file="TargetsSelBMI.csv", sep="\t") Pair <- factor(Targets$Twin) Classes <- factor(Targets$HL) design <- model.matrix(~Pair + Classes) fit<-lmFit(data,design) fit.pair<-eBayes(fit) topGenes<-topTable(fit.pair, coef=12,number=2000) ________________________________ From: Sean Davis <sdavis2@mail.nih.gov> Cc: "bioconductor@r-project.org" <bioconductor@r-project.org> Sent: Thursday, October 6, 2011 2:25 PM Subject: Re: [BioC] Regression analysis problem > Hi everyone, > > I am wondering if anyone can help me understand the following output from limma. > > The problem goes like this: > Say I have methylation values for four genes from three twin pairs. I created some dummy data so that values of gene1 is clearly different within twin pairs. My interest here is to find out the genes that are significantly different within the pairs (i.e. between T1a and T1b, T2a and T2b and T3a and T3b) > > >                 T1a T1b T2a T2b T3a T3b > > "gene1"    16    4    18    4    19    5 > "gene2"    3      1     4     5     2     6 > "gene3"    2      1     4     3     4     5 > "gene4"    3      1     2     3     3    4 > > > I created a design matrix: > >    > Pair<-c("TWIN1","TWIN1","TWIN2","TWIN2","TWIN3","TWIN3") >     >Classes<-c(2,1,2,1,2,1) >     >BMI<-c(45,20,34,12,25,19) >     >Pair<-factor(Pair) >     >Classes<-factor(Classes) >     >design <- model.matrix(~Pair+Classes) > >> design >   (Intercept) PairTWIN2 PairTWIN3 Classes2 > 1           1         0         0        1 > 2           1         0         0        0 > 3           1         1         0        1 > 4           1         1         0        0 > 5           1         0         1        1 > 6           1         0         1        0 > > > ********** > > ...and fitted a linear model to the data to get the following results: > ************************************************************ > ID X.Intercept. PairTWIN2 PairTWIN3      Classes2   AveExpr F > 1 gene1     3.333333       1.0       2.0  1.333333e+01 11.000000 106.298724 > 2 gene2     2.500000       2.5       2.0 -1.000000e+00  3.500000 8.745648 > 3 gene3     1.333333       2.0       3.0  3.333333e-01  3.166667 7.430245 > 4 gene4     2.000000       0.5       1.5  9.064933e-16  2.666667 4.799441 >        P.Value    adj.P.Val > 1 9.993042e-91 3.997217e-90 > 2 4.683758e-07 9.367515e-07 > 3 5.578117e-06 7.437489e-06 > 4 7.186523e-04 7.186523e-04 > ************************************************************* You didn't show us the rest of the code.  In particular, I suspect that you are using something like: topTable(X) You might try: topTable(X,coef=4) Sean > > As you can see, all the genes have adjusted p values less than 0.05. > > But if I just compute the differnces between the pairs > >              T1 T2 T3 > gene1   12  14  14 > gene2    2  -1  -4 > gene3    1   1  -1 > gene4    2  -1  -1 > > > ....and used the design matrix (1,1,1), > gene 1 comes up as significantly different within the pairs. > >  ID      logFC          t      P.Value    adj.P.Val         B > 1 gene1 13.3333333 10.6660452 5.234557e-06 2.093823e-05 46.016217 > 2 gene2 -1.0000000 -0.7999534 4.468388e-01 8.936777e-01 -5.513313 > 3 gene3  0.3333333  0.2666511 7.964821e-01 1.000000e+00 -5.772418 > 4 gene4  0.0000000  0.0000000 1.000000e+00 1.000000e+00 -5.804806 > > > So, my questions are: > Why am I not getting the same (expected) result from my first approach? Or am I misinterpreting  the output? > > What does the p value in the 1st approach signify? > > And most importantly, am I applying the right model? > > Any help on this would be most appreciated. > > Thanks in advance, > > Khadeeja >        [[alternative HTML version deleted]] > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
Hi Khadeeja, On 10/7/2011 9:42 AM, khadeeja ismail wrote: >> In particular, I suspect >> that you are using something like: >> topTable(X) >> You might try: >> topTable(X,coef=4) > > It worked with the small dataset. When I tried the same with a larger set (11 pairs and and some 150k genes) I'm getting different results from the two approaches (even with using topTable(X, coef=12) in approach 1). > I'm finding no significant genes from the first approach, but using the > second approach I'm finding some. Is it possible? I thought I was doing > the same thing in two different ways. Have I overlooked something again? I don't think you have overlooked anything, but there may be some misunderstanding. In your dummy data below, you have very few genes, and fewer samples. In the example you cite above, you use 150k genes and more samples. This will increase your power to detect differences in two ways. First, by increasing the number of genes, you can more accurately estimate an overall variance, which is then used in the eBayes() step to 'shrink' your observed variance towards this overall variance. This is one of the reasons that limma is so popular - by using information from all genes, you can increase the power to detect differences in individual genes. Second, when you increase the number of pairs you are using, your power to detect differences increases as well. This has nothing to do with limma per se; it is just that the power of a t-test increases as you increase the number of observations. Best, Jim > > > Thanks, > Khadeeja > > > > > Targets<-readTargets(file="TargetsSelBMI.csv", sep="\t") > > Pair<- factor(Targets$Twin) > Classes<- factor(Targets$HL) > design<- model.matrix(~Pair + Classes) > > fit<-lmFit(data,design) > fit.pair<-eBayes(fit) > > topGenes<-topTable(fit.pair, coef=12,number=2000) > > > > > > > > > ________________________________ > From: Sean Davis<sdavis2@mail.nih.gov> > > Cc: "bioconductor@r-project.org"<bioconductor@r-project.org> > Sent: Thursday, October 6, 2011 2:25 PM > Subject: Re: [BioC] Regression analysis problem > > >> Hi everyone, >> >> I am wondering if anyone can help me understand the following output from limma. >> >> The problem goes like this: >> Say I have methylation values for four genes from three twin pairs. I created some dummy data so that values of gene1 is clearly different within twin pairs. My interest here is to find out the genes that are significantly different within the pairs (i.e. between T1a and T1b, T2a and T2b and T3a and T3b) >> >> >> T1a T1b T2a T2b T3a T3b >> >> "gene1" 16 4 18 4 19 5 >> "gene2" 3 1 4 5 2 6 >> "gene3" 2 1 4 3 4 5 >> "gene4" 3 1 2 3 3 4 >> >> >> I created a design matrix: >> >> > Pair<-c("TWIN1","TWIN1","TWIN2","TWIN2","TWIN3","TWIN3") >> >Classes<-c(2,1,2,1,2,1) >> >BMI<-c(45,20,34,12,25,19) >> >Pair<-factor(Pair) >> >Classes<-factor(Classes) >> >design<- model.matrix(~Pair+Classes) >> >>> design >> (Intercept) PairTWIN2 PairTWIN3 Classes2 >> 1 1 0 0 1 >> 2 1 0 0 0 >> 3 1 1 0 1 >> 4 1 1 0 0 >> 5 1 0 1 1 >> 6 1 0 1 0 >> >> >> ********** >> >> ...and fitted a linear model to the data to get the following results: >> ************************************************************ >> ID X.Intercept. PairTWIN2 PairTWIN3 Classes2 AveExpr F >> 1 gene1 3.333333 1.0 2.0 1.333333e+01 11.000000 106.298724 >> 2 gene2 2.500000 2.5 2.0 -1.000000e+00 3.500000 8.745648 >> 3 gene3 1.333333 2.0 3.0 3.333333e-01 3.166667 7.430245 >> 4 gene4 2.000000 0.5 1.5 9.064933e-16 2.666667 4.799441 >> P.Value adj.P.Val >> 1 9.993042e-91 3.997217e-90 >> 2 4.683758e-07 9.367515e-07 >> 3 5.578117e-06 7.437489e-06 >> 4 7.186523e-04 7.186523e-04 >> ************************************************************* > You didn't show us the rest of the code. In particular, I suspect > that you are using something like: > > topTable(X) > > You might try: > > topTable(X,coef=4) > > Sean > >> As you can see, all the genes have adjusted p values less than 0.05. >> >> But if I just compute the differnces between the pairs >> >> T1 T2 T3 >> gene1 12 14 14 >> gene2 2 -1 -4 >> gene3 1 1 -1 >> gene4 2 -1 -1 >> >> >> ....and used the design matrix (1,1,1), >> gene 1 comes up as significantly different within the pairs. >> >> ID logFC t P.Value adj.P.Val B >> 1 gene1 13.3333333 10.6660452 5.234557e-06 2.093823e-05 46.016217 >> 2 gene2 -1.0000000 -0.7999534 4.468388e-01 8.936777e-01 -5.513313 >> 3 gene3 0.3333333 0.2666511 7.964821e-01 1.000000e+00 -5.772418 >> 4 gene4 0.0000000 0.0000000 1.000000e+00 1.000000e+00 -5.804806 >> >> >> So, my questions are: >> Why am I not getting the same (expected) result from my first approach? Or am I misinterpreting the output? >> >> What does the p value in the 1st approach signify? >> >> And most importantly, am I applying the right model? >> >> Any help on this would be most appreciated. >> >> Thanks in advance, >> >> Khadeeja >> [[alternative HTML version deleted]] >> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> > [[alternative HTML version deleted]] > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician Douglas Lab University of Michigan Department of Human Genetics 5912 Buhl 1241 E. Catherine St. Ann Arbor MI 48109-5618 734-615-7826 ********************************************************** Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues [[alternative HTML version deleted]]
On Fri, Oct 7, 2011 at 8:15 AM, James W. MacDonald <jmacdon@med.umich.edu>wrote: > > First, by increasing the number of genes, you can more accurately > estimate an overall variance, which is then used in the eBayes() step to > 'shrink' your observed variance towards this overall variance. This is > one of the reasons that limma is so popular - by using information from > all genes, you can increase the power to detect differences in > individual genes. > Careful though -- this is methylation data, which tends to be strongly bimodal. It's not clear that the assumption of a common variance across unmethylated, partially-methylated, and methylated sites is appropriate. I seem to recall Gordon Smyth commenting upon this at one point -- perhaps he'll chime in. > Second, when you increase the number of pairs you are using, your power > to detect differences increases as well. This has nothing to do with > limma per se; it is just that the power of a t-test increases as you > increase the number of observations. > This is of course appropriate at any time that you can get more high- quality samples rather than fewer :-) -- If people do not believe that mathematics is simple, it is only because they do not realize how complicated life is.John von Neumann<http: www-groups.dcs.st-="" and.ac.uk="" ~history="" biographies="" von_neumann.html=""> [[alternative HTML version deleted]]
To follow up on Tim's point: I have yet to see any evidence that methylation data is anything other than ternary. The typical interpretation for the beta values is less than 0.25 = fully unmethylated greater than 0.75 = fully methylated between 0.25 and 0.75 = partially methylated Part of the evidence for this assertion is that we ahve several sets of data from samples treated with drugs that should fully methylate (or fully demethylate, respectively) everything. On those samples, 99% of the observed values are above 0.75 (or below 0.25, respectively). So I'm not convinced that t-tests have role to play in analyzing genome-wide methylation data.... Kevin On 10/7/2011 10:39 AM, Tim Triche, Jr. wrote: > On Fri, Oct 7, 2011 at 8:15 AM, James W. MacDonald<jmacdon at="" med.umich.edu="">wrote: > >> First, by increasing the number of genes, you can more accurately >> estimate an overall variance, which is then used in the eBayes() step to >> 'shrink' your observed variance towards this overall variance. This is >> one of the reasons that limma is so popular - by using information from >> all genes, you can increase the power to detect differences in >> individual genes. >> > Careful though -- this is methylation data, which tends to be strongly > bimodal. It's not clear that the assumption of a common variance across > unmethylated, partially-methylated, and methylated sites is appropriate. I > seem to recall Gordon Smyth commenting upon this at one point -- perhaps > he'll chime in. > > >> Second, when you increase the number of pairs you are using, your power >> to detect differences increases as well. This has nothing to do with >> limma per se; it is just that the power of a t-test increases as you >> increase the number of observations. >> > This is of course appropriate at any time that you can get more high-quality > samples rather than fewer :-) >
Dear Kevin, How you treat the DNA methylation data will very much depend on the specific biological question you are trying to address. It is pretty clear that if you have a mixture of different cell types, each with a different methylation pattern, that small yet biologically and clinically relevant changes in the cell type composition will lead to small yet statistically significant changes in methylation. The evidence for biologically relevant small changes in methylation (differences in means of ~0.1) is overwhelming, for instance in DNA methylation studies conducted on whole blood DNA where blood cell subtype composition changes in response to the presence of cancer. kind regards A. ********************************************************************** ********************************************************************** *** Andrew E Teschendorff PhD Heller Research Fellow Statistical Cancer Genomics Paul O'Gorman Building UCL Cancer Institute University College London 72 Huntley Street London WC1E 6BT, UK. Mob: +44 07876 561263 Email: a.teschendorff at ucl.ac.uk http://www.ucl.ac.uk/cancer/research- groups/statistical_cancer_genomics/index.htm ********************************************************************** ********************************************************************** ________________________________________ From: bioconductor-bounces@r-project.org [bioconductor- bounces@r-project.org] On Behalf Of Kevin R. Coombes [kevin.r.coombes@gmail.com] Sent: 07 October 2011 18:26 To: ttriche at usc.edu Cc: James W. MacDonald; Tim Triche, Jr.; bioconductor at r-project.org; Sean Davis; khadeeja ismail Subject: Re: [BioC] Regression analysis problem To follow up on Tim's point: I have yet to see any evidence that methylation data is anything other than ternary. The typical interpretation for the beta values is less than 0.25 = fully unmethylated greater than 0.75 = fully methylated between 0.25 and 0.75 = partially methylated Part of the evidence for this assertion is that we ahve several sets of data from samples treated with drugs that should fully methylate (or fully demethylate, respectively) everything. On those samples, 99% of the observed values are above 0.75 (or below 0.25, respectively). So I'm not convinced that t-tests have role to play in analyzing genome-wide methylation data.... Kevin On 10/7/2011 10:39 AM, Tim Triche, Jr. wrote: > On Fri, Oct 7, 2011 at 8:15 AM, James W. MacDonald<jmacdon at="" med.umich.edu="">wrote: > >> First, by increasing the number of genes, you can more accurately >> estimate an overall variance, which is then used in the eBayes() step to >> 'shrink' your observed variance towards this overall variance. This is >> one of the reasons that limma is so popular - by using information from >> all genes, you can increase the power to detect differences in >> individual genes. >> > Careful though -- this is methylation data, which tends to be strongly > bimodal. It's not clear that the assumption of a common variance across > unmethylated, partially-methylated, and methylated sites is appropriate. I > seem to recall Gordon Smyth commenting upon this at one point -- perhaps > he'll chime in. > > >> Second, when you increase the number of pairs you are using, your power >> to detect differences increases as well. This has nothing to do with >> limma per se; it is just that the power of a t-test increases as you >> increase the number of observations. >> > This is of course appropriate at any time that you can get more high-quality > samples rather than fewer :-) > _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor