confusing P-value of one gene

0

Entering edit mode

Wolfgang Huber ★ 13k

@wolfgang-huber-3550

Last seen 8 days ago

EMBL European Molecular Biology Laborat…

Dear Xinwei as far as I can see, this is a case where transformation of the count data (e.g. regularized log, or variance-stabilizing) and ANOVA with a normal linear model should give useful results. Since these transformations are like the logarithm for higher counts but avoid the singularity around zero. One place to start with that is chapter 2 of the DESeq2 vignette, functions rlogTransformation and varianceStabilizingTransformation. Doing the testing in this way is also not entirely without pitfalls (e.g. it tends to have less statistical power esp. in the small sample and low-counts regime, and may be susceptible to library-size related biases). I'd be interested to hear how it turns out in your case. Best wishes Wolfgang On Aug 29, 2013, at 4:15 am, Gordon K Smyth <smyth at="" wehi.edu.au=""> wrote: > Dear Xinwei, > > This is a correct result. The reason that the interaction is not statistically significant is inherent in the log-linear model, and hence in the definition of interaction for this sort of model. > > You are probably thinking that the cpm values are much higher for the joint condition CX&RGF than for the other conditions, hence there should be a positive interaction, and this should be statistically significant. > > Indeed, had you tested the joint condition vs the other three conditions it would certainly be significantly higher. > > However the interaction is different. The problem is that there are zero counts for the controls. Hence the fold change from control to CX is infinity, and the fold change from control to RGF is infinity. Hence the counts in the joint condition can be indefinitely large even the absence of any positive interaction. Hence there is no evidence for any positive interaction. In fact, you could make the counts for the CX&RGF libraries as large as you like, and the interaction would never become significant. To make this clear, the counts could have been: > > 0 0 0 0 0 1 0 0 1 1e10 1e10 1e10 > > and this would not give a significant interaction. So long as there are zero counts for the controls, and least one count for the single treatments CX and RGF, the interaction will never become significant. > > You should ignore the logFC in this case, because the interaction logFC is not defined in any meaningful way for this data. > > On the other hand, if you had any positive counts for the controls, then the interaction would suddenly become significant, because the fold changes from control to CX and control to RGF would now be finite. > > I suspect that you might find it more meaningful to test for > > CX&RGF - (control+CX+RGF)/3 > > This will certainly be significant. Or else test for CX&RGF vs each of the other three individually. > > As I've said before, I am not a fan of factorial interaction models for genomic data, and this is yet another example of why this is so. > > Best wishes > Gordon > > > On Wed, 28 Aug 2013, Xinwei Han wrote: > >> Hi, >> >> I manually checked p-values from edgeR and found the p-value of this particular gene, AT1G04500, difficult to understand. The CPM of this gene is like this: >> >> control replicate1: 0 >> control replicate2: 0 >> control replicate3: 0 >> CX replicate1: 0 >> CX replicate2: 0.24 >> CX replicate3: 0 >> RGF replicate1: 0 >> RGF replicate2: 0.14 >> RGF replicate3: 0.19 >> CX&RGF replicate1: 25.14 >> CX&RGF replicate2: 44.36 >> CX&RGF replicate3: 34.62 >> >> I fitted GLM with model.matrix(~RGF + CX + RGF:CX). To find out genes under significant interaction effect, lrt <- glmLRT(fit, coef=4) gives the following results to this gene: >> >> logFC: 5.43 >> logCPM: 3.19 >> LR: 0.012 >> PValue: 0.91 >> >> I do not understand why such dramatic change and such large logFC have p-value of 0.91. I attached the data and R script I used. Could you take a look to see whether I did something wrong in the script? Or there are some other reasons for that? >> >> I used the latest version of R and edgeR. "ms" in the data and script is >> the control. >> >> Thanks >> Xinwei >> > > ______________________________________________________________________ > The information in this email is confidential and intend...{{dropped:4}} > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

edgeR DESeq2 edgeR DESeq2 • 1.0k views

ADD COMMENT • link updated 10.6 years ago by Gordon Smyth 50k • written 10.6 years ago by Wolfgang Huber ★ 13k

0

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 2 hours ago

WEHI, Melbourne, Australia

Well, if you were going to go move to a normal analysis, then using voom() will achieve the same end without (I think) incuring so much loss of power or being so susceptible to library size biases. But really, the problem is not with a singularity, but with what one means by an interaction. As far as I can see, using varianceStabilizingTransformation etc shrinks fold changes and changes the meaning of interactions means in an underpredictable way. Interactions are not simply shrunk: a negative interaction can be turned into a positive interaction and vice versa, and the degree to which this will happen is determined by the mean-variance relationship. voom() or cpm() change interactions also, but at least with those the effect is well understood, not determined by something that should be unrelated, namely the mean-variance relationship. Personally, I'd rather think about what it is one wants to estimate and what hypothesis is tested, rather than allowing these things to be determined as a side-effect by variance stabilization. Gordon > Date: Sun, 1 Sep 2013 16:05:45 +0200 > From: Wolfgang Huber <whuber at="" embl.de=""> > Cc: Bioconductor mailing list <bioconductor at="" r-project.org="">, Xinwei Han > <xinwei.han at="" duke.edu=""> > Subject: Re: [BioC] confusing P-value of one gene > > Dear Xinwei > > as far as I can see, this is a case where transformation of the count > data (e.g. regularized log, or variance-stabilizing) and ANOVA with a > normal linear model should give useful results. Since these > transformations are like the logarithm for higher counts but avoid the > singularity around zero. > > One place to start with that is chapter 2 of the DESeq2 vignette, > functions rlogTransformation and varianceStabilizingTransformation. > Doing the testing in this way is also not entirely without pitfalls > (e.g. it tends to have less statistical power esp. in the small sample > and low-counts regime, and may be susceptible to library-size related > biases). I'd be interested to hear how it turns out in your case. > > Best wishes > Wolfgang > > > > > > On Aug 29, 2013, at 4:15 am, Gordon K Smyth <smyth at="" wehi.edu.au=""> wrote: > >> Dear Xinwei, >> >> This is a correct result. The reason that the interaction is not >> statistically significant is inherent in the log-linear model, and >> hence in the definition of interaction for this sort of model. >> >> You are probably thinking that the cpm values are much higher for the >> joint condition CX&RGF than for the other conditions, hence there >> should be a positive interaction, and this should be statistically >> significant. >> >> Indeed, had you tested the joint condition vs the other three >> conditions it would certainly be significantly higher. >> >> However the interaction is different. The problem is that there are >> zero counts for the controls. Hence the fold change from control to CX >> is infinity, and the fold change from control to RGF is infinity. >> Hence the counts in the joint condition can be indefinitely large even >> the absence of any positive interaction. Hence there is no evidence >> for any positive interaction. In fact, you could make the counts for >> the CX&RGF libraries as large as you like, and the interaction would >> never become significant. To make this clear, the counts could have >> been: >> >> 0 0 0 0 0 1 0 0 1 1e10 1e10 1e10 >> >> and this would not give a significant interaction. So long as there are >> zero counts for the controls, and least one count for the single >> treatments CX and RGF, the interaction will never become significant. >> >> You should ignore the logFC in this case, because the interaction logFC >> is not defined in any meaningful way for this data. >> >> On the other hand, if you had any positive counts for the controls, >> then the interaction would suddenly become significant, because the >> fold changes from control to CX and control to RGF would now be finite. >> >> I suspect that you might find it more meaningful to test for >> >> CX&RGF - (control+CX+RGF)/3 >> >> This will certainly be significant. Or else test for CX&RGF vs each of >> the other three individually. >> >> As I've said before, I am not a fan of factorial interaction models for >> genomic data, and this is yet another example of why this is so. >> >> Best wishes >> Gordon >> >> >> On Wed, 28 Aug 2013, Xinwei Han wrote: >> >>> Hi, >>> >>> I manually checked p-values from edgeR and found the p-value of this >>> particular gene, AT1G04500, difficult to understand. The CPM of this >>> gene is like this: >>> >>> control replicate1: 0 >>> control replicate2: 0 >>> control replicate3: 0 >>> CX replicate1: 0 >>> CX replicate2: 0.24 >>> CX replicate3: 0 >>> RGF replicate1: 0 >>> RGF replicate2: 0.14 >>> RGF replicate3: 0.19 >>> CX&RGF replicate1: 25.14 >>> CX&RGF replicate2: 44.36 >>> CX&RGF replicate3: 34.62 >>> >>> I fitted GLM with model.matrix(~RGF + CX + RGF:CX). To find out genes >>> under significant interaction effect, lrt <- glmLRT(fit, coef=4) gives >>> the following results to this gene: >>> >>> logFC: 5.43 >>> logCPM: 3.19 >>> LR: 0.012 >>> PValue: 0.91 >>> >>> I do not understand why such dramatic change and such large logFC have >>> p-value of 0.91. I attached the data and R script I used. Could you >>> take a look to see whether I did something wrong in the script? Or >>> there are some other reasons for that? >>> >>> I used the latest version of R and edgeR. "ms" in the data and script >>> is the control. >>> >>> Thanks >>> Xinwei ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

ADD COMMENT • link 10.6 years ago Gordon Smyth 50k

0

Entering edit mode

Dear Gordon you're refering to the fact that what is an interaction depends on the scale of the measurement and is generally not invariant under monotonous transformations. This is inconvenient, but one needs to deal with it. The concept is clearly useful, just think about drug- drug or drug-gene interactions. Deciding which scale is appropriate depends on the scientific question, and it can be subtle; there is no end of papers about this, incl. some of mine. But throwing up our hands and telling people that they should not use "factorial interaction models for genomic data" (as you did) is perhaps among the less helpful approaches. In Xinwei's case, with both main effects essentially zero and a signal only seen for the combined treatment, on most scales (incl. VST, voom, identity, whatever) this will result in an interaction, in accordance with Xinwei's intuition, and it is only the pathology of the logarithm around zero that makes the interaction disappear here. With statements on power or robustness, it will be good to move the level of discussion from proclamation to evidence. Best wishes Wolfgang On 3 Sep 2013, at 11:16, Gordon K Smyth <smyth at="" wehi.edu.au=""> wrote: > Well, if you were going to go move to a normal analysis, then using voom() will achieve the same end without (I think) incuring so much loss of power or being so susceptible to library size biases. > > But really, the problem is not with a singularity, but with what one means by an interaction. As far as I can see, using varianceStabilizingTransformation etc shrinks fold changes and changes the meaning of interactions means in an underpredictable way. Interactions are not simply shrunk: a negative interaction can be turned into a positive interaction and vice versa, and the degree to which this will happen is determined by the mean-variance relationship. voom() or cpm() change interactions also, but at least with those the effect is well understood, not determined by something that should be unrelated, namely the mean-variance relationship. > > Personally, I'd rather think about what it is one wants to estimate and what hypothesis is tested, rather than allowing these things to be determined as a side-effect by variance stabilization. > > Gordon > > >> Date: Sun, 1 Sep 2013 16:05:45 +0200 >> From: Wolfgang Huber <whuber at="" embl.de=""> >> Cc: Bioconductor mailing list <bioconductor at="" r-project.org="">, Xinwei Han >> <xinwei.han at="" duke.edu=""> >> Subject: Re: [BioC] confusing P-value of one gene >> >> Dear Xinwei >> >> as far as I can see, this is a case where transformation of the count data (e.g. regularized log, or variance-stabilizing) and ANOVA with a normal linear model should give useful results. Since these transformations are like the logarithm for higher counts but avoid the singularity around zero. >> >> One place to start with that is chapter 2 of the DESeq2 vignette, functions rlogTransformation and varianceStabilizingTransformation. Doing the testing in this way is also not entirely without pitfalls (e.g. it tends to have less statistical power esp. in the small sample and low-counts regime, and may be susceptible to library-size related biases). I'd be interested to hear how it turns out in your case. >> >> Best wishes >> Wolfgang >> >> >> >> >> >> On Aug 29, 2013, at 4:15 am, Gordon K Smyth <smyth at="" wehi.edu.au=""> wrote: >> >>> Dear Xinwei, >>> >>> This is a correct result. The reason that the interaction is not statistically significant is inherent in the log-linear model, and hence in the definition of interaction for this sort of model. >>> >>> You are probably thinking that the cpm values are much higher for the joint condition CX&RGF than for the other conditions, hence there should be a positive interaction, and this should be statistically significant. >>> >>> Indeed, had you tested the joint condition vs the other three conditions it would certainly be significantly higher. >>> >>> However the interaction is different. The problem is that there are zero counts for the controls. Hence the fold change from control to CX is infinity, and the fold change from control to RGF is infinity. Hence the counts in the joint condition can be indefinitely large even the absence of any positive interaction. Hence there is no evidence for any positive interaction. In fact, you could make the counts for the CX&RGF libraries as large as you like, and the interaction would never become significant. To make this clear, the counts could have been: >>> >>> 0 0 0 0 0 1 0 0 1 1e10 1e10 1e10 >>> >>> and this would not give a significant interaction. So long as there are zero counts for the controls, and least one count for the single treatments CX and RGF, the interaction will never become significant. >>> >>> You should ignore the logFC in this case, because the interaction logFC is not defined in any meaningful way for this data. >>> >>> On the other hand, if you had any positive counts for the controls, then the interaction would suddenly become significant, because the fold changes from control to CX and control to RGF would now be finite. >>> >>> I suspect that you might find it more meaningful to test for >>> >>> CX&RGF - (control+CX+RGF)/3 >>> >>> This will certainly be significant. Or else test for CX&RGF vs each of the other three individually. >>> >>> As I've said before, I am not a fan of factorial interaction models for genomic data, and this is yet another example of why this is so. >>> >>> Best wishes >>> Gordon >>> >>> >>> On Wed, 28 Aug 2013, Xinwei Han wrote: >>> >>>> Hi, >>>> >>>> I manually checked p-values from edgeR and found the p-value of this particular gene, AT1G04500, difficult to understand. The CPM of this gene is like this: >>>> >>>> control replicate1: 0 >>>> control replicate2: 0 >>>> control replicate3: 0 >>>> CX replicate1: 0 >>>> CX replicate2: 0.24 >>>> CX replicate3: 0 >>>> RGF replicate1: 0 >>>> RGF replicate2: 0.14 >>>> RGF replicate3: 0.19 >>>> CX&RGF replicate1: 25.14 >>>> CX&RGF replicate2: 44.36 >>>> CX&RGF replicate3: 34.62 >>>> >>>> I fitted GLM with model.matrix(~RGF + CX + RGF:CX). To find out genes under significant interaction effect, lrt <- glmLRT(fit, coef=4) gives the following results to this gene: >>>> >>>> logFC: 5.43 >>>> logCPM: 3.19 >>>> LR: 0.012 >>>> PValue: 0.91 >>>> >>>> I do not understand why such dramatic change and such large logFC have p-value of 0.91. I attached the data and R script I used. Could you take a look to see whether I did something wrong in the script? Or there are some other reasons for that? >>>> >>>> I used the latest version of R and edgeR. "ms" in the data and script is the control. >>>> >>>> Thanks >>>> Xinwei > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:6}}

ADD REPLY • link 10.6 years ago Wolfgang Huber ★ 13k

Login before adding your answer.