Question

Possible Bugs in Contrasts Design of edgeR

0

Entering edit mode

Guest User ★ 13k

@guest-user-4897

Last seen 10.0 years ago

Hi, I am a PhD student from Dept. Statistics of Penn State University. I have been using edgeR for my RNA-seq data recently. As I moved forward with edgeR, I found there are probably some bugs in the latest documentation. My experimental design is exactly the same as the design (Comparisons Both Between and Within Subject) on page 32 in the latest documentation (last revised on 31 March 2013). From page 32 to 34, everything looks fine to me except for the contrasts defined at the end. If I understand it properly, with the procedure showed on those pages, I can estimate each effect in the design, which can be showed with colnames(design). Then, all contrasts should be based on those estimated effects. For the first contrast on page 34, it is trying to find genes that respond differently to the hormone in disease1 vs healthy patients. The contrast looks as follows: lrt <- glmLRT(fit, contrast = c(0, 0, 0, 0, 0, 0, 0, 0, 0, -1, 1, 0)) However, it looks not right to me, and it only put weights on the 10th and 11th effects ("-1" and "1"), which are corresponding to "DiseaseHealthy:TreatmentHormone" and "DiseaseDisease1:TreatmentHormone". As I see, the intercept (baseline) represents the combined effect from Healthy, Patient 1, and None Treatment. In order to find genes that respond differently to the hormone in disease1 vs healthy patients (regardless which patient it is), you could not leave the weights for all 4th to 9th effects as zeros. Also, as the contrast compares disease1 to healthy, the weight in the contrast for the 2nd effect should not be zero as well. I think the right contrast should be as follows: lrt <- glmLRT(fit, contrast = c(0, 1, 0, -1/3, 1/3, 0, -1/3, 1/3, 0, -1, 1, 0)) Please feel free to correct me if I am wrong. Thanks, Yang Liu -- output of sessionInfo(): R version 2.15.2 (2012-10-26) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] edgeR_3.0.8 limma_3.14.4 -- Sent via the guest posting facility at bioconductor.org.

edgeR edgeR • 873 views

ADD COMMENT • link updated 11.4 years ago by Gordon Smyth 51k • written 11.4 years ago by Guest User ★ 13k

score 0 · Answer 1 · 2013-04-29

0

Entering edit mode

Gordon Smyth 51k

@gordon-smyth

Last seen 17 hours ago

WEHI, Melbourne, Australia

Dear Yang Liu, I believe that the documentation is correct as it stands. The coefficient "DiseaseHealthy:TreatmentHormone" represents the average log-fold-change after hormone treatment for healthy patients. The coefficient "DiseaseDisease1:TreatmentHormone" represents the average log-fold-change after hormone treatment for patients in disease group 1. So taking the difference of these two coefficients represents the contrast as described in the documentation. The linear model here is equivalent to three paired t-tests, one for each disease group. In effect we are fitting an additive model (patient+treatment) within each disease group. You are correct in your interpretation of the intercept term, but this is not relevant because it cancels out of any contrast. Best wishes Gordon ---------------- original message ---------------- [BioC] Possible Bugs in Contrasts Design of edgeR Yang Liu [guest] guest at bioconductor.org Thu Apr 25 17:59:21 CEST 2013 Hi, I am a PhD student from Dept. Statistics of Penn State University. I have been using edgeR for my RNA-seq data recently. As I moved forward with edgeR, I found there are probably some bugs in the latest documentation. My experimental design is exactly the same as the design (Comparisons Both Between and Within Subject) on page 32 in the latest documentation (last revised on 31 March 2013). From page 32 to 34, everything looks fine to me except for the contrasts defined at the end. If I understand it properly, with the procedure showed on those pages, I can estimate each effect in the design, which can be showed with colnames(design). Then, all contrasts should be based on those estimated effects. For the first contrast on page 34, it is trying to find genes that respond differently to the hormone in disease1 vs healthy patients. The contrast looks as follows: lrt <- glmLRT(fit, contrast = c(0, 0, 0, 0, 0, 0, 0, 0, 0, -1, 1, 0)) However, it looks not right to me, and it only put weights on the 10th and 11th effects ("-1" and "1"), which are corresponding to "DiseaseHealthy:TreatmentHormone" and "DiseaseDisease1:TreatmentHormone". As I see, the intercept (baseline) represents the combined effect from Healthy, Patient 1, and None Treatment. In order to find genes that respond differently to the hormone in disease1 vs healthy patients (regardless which patient it is), you could not leave the weights for all 4th to 9th effects as zeros. Also, as the contrast compares disease1 to healthy, the weight in the contrast for the 2nd effect should not be zero as well. I think the right contrast should be as follows: lrt <- glmLRT(fit, contrast = c(0, 1, 0, -1/3, 1/3, 0, -1/3, 1/3, 0, -1, 1, 0)) Please feel free to correct me if I am wrong. Thanks, Yang Liu -- output of sessionInfo(): R version 2.15.2 (2012-10-26) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] edgeR_3.0.8 limma_3.14.4 -- ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

ADD COMMENT • link 11.4 years ago Gordon Smyth 51k

0

Entering edit mode

Dear Prof. Smyth, Thanks for your reply. Great to hear back from you. I see your points, but it still confuses me. First, thank you for confirming that my interpretation of the intercept is correct. To me, it also cancels out in any contrast, as you may find the 1st coefficient of my contrast is also zero. Also, I agree that we are fitting an additive model (patient + treatment) within each disease group, that is why there is no interaction between patient and treatment. However, the case is that we are estimating the patient's effect nested in each disease as a fixed effect (not a random effect). Then, the patients in Healthy group are different from the patients in Disease1 group, hence patient's nested effects could not cancel out in the contrast. Generally, based on the design (see the attached figure), if we are trying to find genes that respond differently to the hormone in disease1 vs healthy patients, we should average observation 8, 10 and 12 (the average log-fold-change after hormone treatment for patients in disease group 1), then subtract the average of observation 2, 4 and 6 (the average log-fold-change after hormone treatment for healthy patients). And it ends up with my contrast showed in my previous thread (i.e. *lrt <- glmLRT(fit, contrast = c(0, 1, 0, -1/3, 1/3, 0, -1/3, 1/3, 0, -1, 1, 0))*) Also, we can find in the design, the observation 1 (first row in the design, healthy patient 1 without treatment) has only coefficient on Intercept, which is consistent with the concept that the intercept (baseline) represents the combined effect from Healthy, Patient 1, and None Treatment. Meanwhile, besides intercept, the observation 2 (second row in the design, healthy patient 1 with treatment) has another coefficient on "DiseaseHealthy:TreatmentHormone". Hence, we could say that the coefficient "DiseaseHealthy:TreatmentHormone" only represents the difference of log-fold-change between "with and without hormone treatment" for healthy patient 1. We can also check other observations. Especially, for the observation 4 (healthy patient 2 with treatment), it has coefficients on intercept, "DiseaseHealthy:Patient2", and "DiseaseHealthy:TreatmentHormone". Hence, I still insist that my contrast would be more appropriate. Hope my points make sense to you. If not, please feel free to correct me. Thanks for your time and helps. Sincerely, Yang On Mon, Apr 29, 2013 at 4:17 AM, Gordon K Smyth <smyth at="" wehi.edu.au=""> wrote: > Dear Yang Liu, > > I believe that the documentation is correct as it stands. > > The coefficient "DiseaseHealthy:**TreatmentHormone" represents the > average log-fold-change after hormone treatment for healthy patients. The > coefficient "DiseaseDisease1:**TreatmentHormone" represents the average > log-fold-change after hormone treatment for patients in disease group 1. So > taking the difference of these two coefficients represents the contrast as > described in the documentation. > > The linear model here is equivalent to three paired t-tests, one for each > disease group. In effect we are fitting an additive model > (patient+treatment) within each disease group. > > You are correct in your interpretation of the intercept term, but this is > not relevant because it cancels out of any contrast. > > Best wishes > Gordon > > ---------------- original message ---------------- > > [BioC] Possible Bugs in Contrasts Design of edgeR > Yang Liu [guest] guest at bioconductor.org > Thu Apr 25 17:59:21 CEST 2013 > > Hi, > > I am a PhD student from Dept. Statistics of Penn State University. I have > been using edgeR for my RNA-seq data recently. As I moved forward with > edgeR, I found there are probably some bugs in the latest documentation. > > My experimental design is exactly the same as the design (Comparisons Both > Between and Within Subject) on page 32 in the latest documentation (last > revised on 31 March 2013). From page 32 to 34, everything looks fine to me > except for the contrasts defined at the end. > > If I understand it properly, with the procedure showed on those pages, I > can estimate each effect in the design, which can be showed with > colnames(design). Then, all contrasts should be based on those estimated > effects. For the first contrast on page 34, it is trying to find genes that > respond differently to the hormone in disease1 vs healthy patients. The > contrast looks as follows: > > lrt <- glmLRT(fit, contrast = c(0, 0, 0, 0, 0, 0, 0, 0, 0, -1, 1, 0)) > > However, it looks not right to me, and it only put weights on the 10th and > 11th effects ("-1" and "1"), which are corresponding to "DiseaseHealthy:**TreatmentHormone" > and "DiseaseDisease1:**TreatmentHormone". As I see, the intercept > (baseline) represents the combined effect from Healthy, Patient 1, and None > Treatment. In order to find genes that respond differently to the hormone > in disease1 vs healthy patients (regardless which patient it is), you could > not leave the weights for all 4th to 9th effects as zeros. Also, as the > contrast compares disease1 to healthy, the weight in the contrast for the > 2nd effect should not be zero as well. I think the right contrast should be > as follows: > > lrt <- glmLRT(fit, contrast = c(0, 1, 0, -1/3, 1/3, 0, -1/3, 1/3, 0, -1, > 1, 0)) > > Please feel free to correct me if I am wrong. > > Thanks, > > Yang Liu > > -- output of sessionInfo(): > > R version 2.15.2 (2012-10-26) > Platform: x86_64-apple-darwin9.8.0/x86_**64 (64-bit) > > locale: > [1] en_US.UTF-8/en_US.UTF-8/en_US.**UTF-8/C/en_US.UTF-8/en_US.UTF-**8 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] edgeR_3.0.8 limma_3.14.4 > > -- > > ______________________________**______________________________**____ ______ > The information in this email is confidential and intended solely for the > addressee. > You must not disclose, forward, print or use it without the permission of > the sender. > ______________________________**______________________________**____ ______ > > -------------- next part -------------- A non-text attachment was scrubbed... Name: design.png Type: image/png Size: 36262 bytes Desc: not available URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20130430="" 4bfc9f02="" attachment.png="">

ADD REPLY • link 11.4 years ago Yang Liu ▴ 10

0

Entering edit mode

Dear Yang, On Tue, 30 Apr 2013, Yang Liu wrote: > Dear Prof. Smyth, > > Thanks for your reply. Great to hear back from you. I see your points, but > it still confuses me. > > First, thank you for confirming that my interpretation of the intercept is > correct. To me, it also cancels out in any contrast, as you may find the > 1st coefficient of my contrast is also zero. > > Also, I agree that we are fitting an additive model (patient + treatment) > within each disease group, that is why there is no interaction between > patient and treatment. > > However, the case is that we are estimating the patient's effect nested in > each disease as a fixed effect (not a random effect). Then, the patients in > Healthy group are different from the patients in Disease1 group, hence > patient's nested effects could not cancel out in the contrast. You are not fully understanding how a paired analysis works. Treatments are compared only within patients. The between-patient error strata is not used at all in the analysis. > Generally, based on the design (see the attached figure), if we are > trying to find genes that respond differently to the hormone in disease1 > vs healthy patients, we should average observation 8, 10 and 12 (the > average log-fold-change after hormone treatment for patients in disease > group 1), then subtract the average of observation 2, 4 and 6 (the > average log-fold-change after hormone treatment for healthy patients). No, this is in incorrect. You have failed to adjust for the baseline expression level for each individual patient when treatment="none". If this is still unclear, your PhD advisor or professors might be able to help. Best wishes Gordon > And it ends up with my contrast showed in my previous thread (i.e. *lrt > <- glmLRT(fit, contrast = c(0, 1, 0, -1/3, 1/3, 0, -1/3, 1/3, 0, -1, 1, > 0))*) > > Also, we can find in the design, the observation 1 (first row in the > design, healthy patient 1 without treatment) has only coefficient on > Intercept, which is consistent with the concept that the intercept > (baseline) represents the combined effect from Healthy, Patient 1, and None > Treatment. Meanwhile, besides intercept, the observation 2 (second row in > the design, healthy patient 1 with treatment) has another coefficient on > "DiseaseHealthy:TreatmentHormone". Hence, we could say that the coefficient > "DiseaseHealthy:TreatmentHormone" only represents the difference of > log-fold-change between "with and without hormone treatment" for healthy > patient 1. We can also check other observations. Especially, for the > observation 4 (healthy patient 2 with treatment), it has coefficients on > intercept, "DiseaseHealthy:Patient2", and > "DiseaseHealthy:TreatmentHormone". > > Hence, I still insist that my contrast would be more appropriate. Hope my > points make sense to you. If not, please feel free to correct me. Thanks > for your time and helps. > > Sincerely, > > Yang > > > > On Mon, Apr 29, 2013 at 4:17 AM, Gordon K Smyth <smyth at="" wehi.edu.au=""> wrote: > >> Dear Yang Liu, >> >> I believe that the documentation is correct as it stands. >> >> The coefficient "DiseaseHealthy:**TreatmentHormone" represents the >> average log-fold-change after hormone treatment for healthy patients. The >> coefficient "DiseaseDisease1:**TreatmentHormone" represents the average >> log-fold-change after hormone treatment for patients in disease group 1. So >> taking the difference of these two coefficients represents the contrast as >> described in the documentation. >> >> The linear model here is equivalent to three paired t-tests, one for each >> disease group. In effect we are fitting an additive model >> (patient+treatment) within each disease group. >> >> You are correct in your interpretation of the intercept term, but this is >> not relevant because it cancels out of any contrast. >> >> Best wishes >> Gordon >> >> ---------------- original message ---------------- >> >> [BioC] Possible Bugs in Contrasts Design of edgeR >> Yang Liu [guest] guest at bioconductor.org >> Thu Apr 25 17:59:21 CEST 2013 >> >> Hi, >> >> I am a PhD student from Dept. Statistics of Penn State University. I have >> been using edgeR for my RNA-seq data recently. As I moved forward with >> edgeR, I found there are probably some bugs in the latest documentation. >> >> My experimental design is exactly the same as the design (Comparisons Both >> Between and Within Subject) on page 32 in the latest documentation (last >> revised on 31 March 2013). From page 32 to 34, everything looks fine to me >> except for the contrasts defined at the end. >> >> If I understand it properly, with the procedure showed on those pages, I >> can estimate each effect in the design, which can be showed with >> colnames(design). Then, all contrasts should be based on those estimated >> effects. For the first contrast on page 34, it is trying to find genes that >> respond differently to the hormone in disease1 vs healthy patients. The >> contrast looks as follows: >> >> lrt <- glmLRT(fit, contrast = c(0, 0, 0, 0, 0, 0, 0, 0, 0, -1, 1, 0)) >> >> However, it looks not right to me, and it only put weights on the 10th and >> 11th effects ("-1" and "1"), which are corresponding to "DiseaseHealthy:**TreatmentHormone" >> and "DiseaseDisease1:**TreatmentHormone". As I see, the intercept >> (baseline) represents the combined effect from Healthy, Patient 1, and None >> Treatment. In order to find genes that respond differently to the hormone >> in disease1 vs healthy patients (regardless which patient it is), you could >> not leave the weights for all 4th to 9th effects as zeros. Also, as the >> contrast compares disease1 to healthy, the weight in the contrast for the >> 2nd effect should not be zero as well. I think the right contrast should be >> as follows: >> >> lrt <- glmLRT(fit, contrast = c(0, 1, 0, -1/3, 1/3, 0, -1/3, 1/3, 0, -1, >> 1, 0)) >> >> Please feel free to correct me if I am wrong. >> >> Thanks, >> >> Yang Liu >> >> -- output of sessionInfo(): >> >> R version 2.15.2 (2012-10-26) >> Platform: x86_64-apple-darwin9.8.0/x86_**64 (64-bit) >> >> locale: >> [1] en_US.UTF-8/en_US.UTF-8/en_US.**UTF-8/C/en_US.UTF-8/en_US.UTF-**8 >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] edgeR_3.0.8 limma_3.14.4 >> ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

ADD REPLY • link 11.4 years ago Gordon Smyth 51k