Question

Contrasts for 3x2 factorial experiment in R/edgeR

0

Entering edit mode

Gordon Smyth 53k

@gordon-smyth

Last seen 12 hours ago

WEHI, Melbourne, Australia

Dear Ben, glmLRT() allows you to specify the contrast to be tested in two ways. If the contrast happens to correspond to a column of your design matrix, then it is one the original coefficients and you can just specify which one using the coef= argument. Otherwise you can use the contrast= argument to specify any contrast, using a numerical vector of the same length as the number of coefficients. Suppose you used the default treatment parametrization: contrasts(Strain) <- contr.treat(3) contrasts(Treatment) <- contr.treat(2) and fitted a model with Treatment nested within Strain: design <- model.matrix(~Strain+Strain:Treatment) Then the coefficients 4:6 would be the logFC after stimulation in strains A to C respectively. Hence you could test for differential expression after stimulation in strain B by: glmLRT(dge,fit,coef=5) and so on. You could equally well specify the same test by glmLRT(dge,fit,contrast=c(0,0,0,0,1,0,0)) To test for genes responding differently to simulation in strain B compared to strain A, you would use contrast=c(0,0,0,-1,1,0) And so on. Figuring out the appropriate contrast vectors using the sum-to-zero parametrization specified by contr.sum() is more tedious, and I will leave that to you. Best wishes Gordon > Date: Tue, 20 Dec 2011 16:44:42 -0500 > From: Benjamin King <bking at="" mdibl.org=""> > To: bioconductor at r-project.org > Subject: [BioC] Contrasts for 3x2 factorial experiment in R/edgeR > > Hello, > > I'm using R/edgeR to analyze a 3x2 factorial RNA-Seq experiment and I would very much appreciate your help in specifying some contrasts. > > The design of my experiment has two factors, strain and treatment. There are three strains (A, B and C) and two treatments (Unstimulated and Stimulated). I have four biological replicates (except for one sample group where there are only three). > > # Group Strain Treatment > # A.Un A Un > # B.Un B Un > # C.Un C Un > # A.Stim A Stim > # B.Stim B Stim > # C.Stim C Stim > > Using "model.matrix(~Strain*Treatment)" I believe the model coefficients using sum to zero parameterization are as follows. > > Intercept: (A.Un + B.Un + C.Un + A.Stim + B.Stim + C.Stim)/6 > Strain1: (-A.Un + B.Un - A.Stim + B.Stim)/4 > Strain2: (-A.Un + C.Un - A.Stim + C.Stim)/4 > Treatment1: (-A.Un - B.Un - C.Un + A.Stim + B.Stim + C.Stim)/6 > Strain1:Treatment1: (A.Un - B.Un - A.Stim + B.Stim)/4 > Strain2:Treatment2: (A.Un - C.Un - A.Stim + C.Stim)/4 > > I'm interested in these questions: > > 1. Which genes are differentially expressed in Strain B compared to Strain A? > 2. Which genes are differentially expressed in Strain C compared to Strain A? > 3. Which genes respond to Stimulated treatment overall? > 4. Which genes respond to Stimulated treatment in Strain B? > 5. Which genes respond to Stimulated treatment in Strain C? > 6. Which genes respond to Stimulated treatment differently in Strain B compared to Strain A? > 7. Which genes respond to Stimulated treatment differently in Strain B compared to Strain A? > > I believe questions 1, 2, 3, 6 and 7 are model coefficients. If so, how can these contrasts be calculated using the "coef" parameter in the glmLRT function? > > For questions 4 and 5, what is the correct syntax to define the contrast matrix using sum to zero parameterization that I would then pass to the glmLRT function? > > Below is my current script and session information. > > Thank you in advance for your help, > > - Ben > > > > library(edgeR) > > # 3x2 Factorial Design > # Strain Treatment > # A Un > # B Un > # C Un > # A Stim > # B Stim > # C Stim > > ## Specify factors > Strain <- factor(c("A","A","A","B","B","B","B","C","C","C","C","A"," A","A","A","B","B","B","B","C","C","C","C")) > Treatment <- factor(c("Un","Un","Un","Un","Un","Un","Un","Un","Un"," Un","Un","Stim","Stim","Stim","Stim","Stim","Stim","Stim","Stim","Stim ","Stim","Stim","Stim")) > > ## Read in count data > raw.data <- read.table("group_counts.txt",sep="\t",header=T) > d <- raw.data[, 2:24] > rownames(d) <- raw.data[, 1] > > # make DGE object > dge <- DGEList(counts = d, group = c("A.Un","A.Un","A.Un","B.Un","B. Un","B.Un","B.Un","C.Un","C.Un","C.Un","C.Un","A.Stim","A.Stim","A.Sti m","A.Stim","B.Stim","B.Stim","B.Stim","B.Stim","C.Stim","C.Stim","C.S tim","C.Stim")) > dge <- calcNormFactors(dge) > > # filter uninformative genes > m <- 1e6 * t(t(dge$counts) / dge$samples$lib.size) > ridx <- rowSums(m > 1) >= 2 > dge <- dge[ridx,] > > ## Design matrix > # treatment-contrast parameterization > contrasts(Strain) <- contr.sum(3) > contrasts(Treatment) <- contr.sum(2) > design <- model.matrix(~Strain*Treatment) > > ## Estimate Dispersion > dge <- estimateGLMCommonDisp(dge, design) > > ## Fit model with Common Dispersion > fit <- glmFit(dge, design, dispersion=dge$common.dispersion) > > > > >> sessionInfo() > R version 2.13.1 (2011-07-08) > Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) > > locale: > [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] splines stats graphics grDevices utils datasets methods base > > other attached packages: > [1] edgeR_2.2.5 > > loaded via a namespace (and not attached): > [1] limma_3.8.3 tools_2.13.1 ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

• 2.5k views

ADD COMMENT • link updated 14.1 years ago by Benjamin King ▴ 30 • written 14.1 years ago by Gordon Smyth 53k

score 0 · Answer 1 · 2012-01-03

Dear Gordon, I very much appreciate your reply and I will follow your suggestion of not using the sum-to-zero parameterization. I do have one additional question about interpreting the model coefficients. If my design matrix is as shown below and the model is "~Strain+Strain:Treatment" is the correct interpretation of the coefficients? Coefficient Interpretation Intercept Strain1 "Strain B compared to Strain A (i.e., Which genes are differentially expressed in Strain B compared to Strain A?)" Strain2 "Strain C compared to Strain A (i.e., Which genes are differentially expressed in Strain C compared to Strain A?)" Treatment1 "Overall treatment effect (i.e., Which genes respond to Stimulated treatment overall?)" Strain1:Treatment1 "Interaction between StrainB and Treatment (i.e., Which genes respond to Stimulated treatment differently in Strain B compared to Strain A?)" Strain2:Treatment2 "Interaction between StrainC and Treatment (i.e., Which genes respond to Stimulated treatment differently in Strain C compared to Strain A?)" (Intercept) Strain1 Strain2 Treatment1 Strain1:Treatment1 Strain2:Treatment1 A.Un 1 -1 -1 -1 1 1 A.Un 1 -1 -1 -1 1 1 A.Un 1 -1 -1 -1 1 1 B.Un 1 1 0 -1 -1 0 B.Un 1 1 0 -1 -1 0 B.Un 1 1 0 -1 -1 0 B.Un 1 1 0 -1 -1 0 C.Un 1 0 0 -1 0 -1 C.Un 1 0 0 -1 0 -1 C.Un 1 0 0 -1 0 -1 C.Un 1 0 0 -1 0 -1 A.Stim 1 -1 -1 1 -1 -1 A.Stim 1 -1 -1 1 -1 -1 A.Stim 1 -1 -1 1 -1 -1 A.Stim 1 -1 -1 1 -1 -1 B.Stim 1 1 0 1 1 0 B.Stim 1 1 0 1 1 0 B.Stim 1 1 0 1 1 0 B.Stim 1 1 0 1 1 0 C.Stim 1 0 1 1 0 1 C.Stim 1 0 1 1 0 1 C.Stim 1 0 1 1 0 1 C.Stim 1 0 1 1 0 1 If this is correct, then is the interaction between strain B and treatment the fifth coefficient or c(0,0,0,0,1,0)? Likewise, is the interaction between strain C and treatment the sixth coefficient or c(0,0,0,0,0,1)? Thank you, - Ben On Jan 2, 2012, at 6:16 PM, Gordon K Smyth wrote: > Dear Ben, > > glmLRT() allows you to specify the contrast to be tested in two ways. If the contrast happens to correspond to a column of your design matrix, then it is one the original coefficients and you can just specify which one using the coef= argument. Otherwise you can use the contrast= argument to specify any contrast, using a numerical vector of the same length as the number of coefficients. > > Suppose you used the default treatment parametrization: > > contrasts(Strain) <- contr.treat(3) > contrasts(Treatment) <- contr.treat(2) > > and fitted a model with Treatment nested within Strain: > > design <- model.matrix(~Strain+Strain:Treatment) > > Then the coefficients 4:6 would be the logFC after stimulation in strains A to C respectively. Hence you could test for differential expression after stimulation in strain B by: > > glmLRT(dge,fit,coef=5) > > and so on. You could equally well specify the same test by > > glmLRT(dge,fit,contrast=c(0,0,0,0,1,0,0)) > > To test for genes responding differently to simulation in strain B compared to strain A, you would use > > contrast=c(0,0,0,-1,1,0) > > And so on. > > Figuring out the appropriate contrast vectors using the sum-to-zero parametrization specified by contr.sum() is more tedious, and I will leave that to you. > > Best wishes > Gordon > >> Date: Tue, 20 Dec 2011 16:44:42 -0500 >> From: Benjamin King <bking@mdibl.org> >> To: bioconductor@r-project.org >> Subject: [BioC] Contrasts for 3x2 factorial experiment in R/edgeR >> >> Hello, >> >> I'm using R/edgeR to analyze a 3x2 factorial RNA-Seq experiment and I would very much appreciate your help in specifying some contrasts. >> >> The design of my experiment has two factors, strain and treatment. There are three strains (A, B and C) and two treatments (Unstimulated and Stimulated). I have four biological replicates (except for one sample group where there are only three). >> >> # Group Strain Treatment >> # A.Un A Un >> # B.Un B Un >> # C.Un C Un >> # A.Stim A Stim >> # B.Stim B Stim >> # C.Stim C Stim >> >> Using "model.matrix(~Strain*Treatment)" I believe the model coefficients using sum to zero parameterization are as follows. >> >> Intercept: (A.Un + B.Un + C.Un + A.Stim + B.Stim + C.Stim)/6 >> Strain1: (-A.Un + B.Un - A.Stim + B.Stim)/4 >> Strain2: (-A.Un + C.Un - A.Stim + C.Stim)/4 >> Treatment1: (-A.Un - B.Un - C.Un + A.Stim + B.Stim + C.Stim)/6 >> Strain1:Treatment1: (A.Un - B.Un - A.Stim + B.Stim)/4 >> Strain2:Treatment2: (A.Un - C.Un - A.Stim + C.Stim)/4 >> >> I'm interested in these questions: >> >> 1. Which genes are differentially expressed in Strain B compared to Strain A? >> 2. Which genes are differentially expressed in Strain C compared to Strain A? >> 3. Which genes respond to Stimulated treatment overall? >> 4. Which genes respond to Stimulated treatment in Strain B? >> 5. Which genes respond to Stimulated treatment in Strain C? >> 6. Which genes respond to Stimulated treatment differently in Strain B compared to Strain A? >> 7. Which genes respond to Stimulated treatment differently in Strain B compared to Strain A? >> >> I believe questions 1, 2, 3, 6 and 7 are model coefficients. If so, how can these contrasts be calculated using the "coef" parameter in the glmLRT function? >> >> For questions 4 and 5, what is the correct syntax to define the contrast matrix using sum to zero parameterization that I would then pass to the glmLRT function? >> >> Below is my current script and session information. >> >> Thank you in advance for your help, >> >> - Ben >> >> >> >> library(edgeR) >> >> # 3x2 Factorial Design >> # Strain Treatment >> # A Un >> # B Un >> # C Un >> # A Stim >> # B Stim >> # C Stim >> >> ## Specify factors >> Strain <- factor(c("A","A","A","B","B","B","B","C","C","C","C","A", "A","A","A","B","B","B","B","C","C","C","C")) >> Treatment <- factor(c("Un","Un","Un","Un","Un","Un","Un","Un","Un", "Un","Un","Stim","Stim","Stim","Stim","Stim","Stim","Stim","Stim","Sti m","Stim","Stim","Stim")) >> >> ## Read in count data >> raw.data <- read.table("group_counts.txt",sep="\t",header=T) >> d <- raw.data[, 2:24] >> rownames(d) <- raw.data[, 1] >> >> # make DGE object >> dge <- DGEList(counts = d, group = c("A.Un","A.Un","A.Un","B.Un","B .Un","B.Un","B.Un","C.Un","C.Un","C.Un","C.Un","A.Stim","A.Stim","A.St im","A.Stim","B.Stim","B.Stim","B.Stim","B.Stim","C.Stim","C.Stim","C. Stim","C.Stim")) >> dge <- calcNormFactors(dge) >> >> # filter uninformative genes >> m <- 1e6 * t(t(dge$counts) / dge$samples$lib.size) >> ridx <- rowSums(m > 1) >= 2 >> dge <- dge[ridx,] >> >> ## Design matrix >> # treatment-contrast parameterization >> contrasts(Strain) <- contr.sum(3) >> contrasts(Treatment) <- contr.sum(2) >> design <- model.matrix(~Strain*Treatment) >> >> ## Estimate Dispersion >> dge <- estimateGLMCommonDisp(dge, design) >> >> ## Fit model with Common Dispersion >> fit <- glmFit(dge, design, dispersion=dge$common.dispersion) >> >> >> >> >>> sessionInfo() >> R version 2.13.1 (2011-07-08) >> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) >> >> locale: >> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 >> >> attached base packages: >> [1] splines stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] edgeR_2.2.5 >> >> loaded via a namespace (and not attached): >> [1] limma_3.8.3 tools_2.13.1 > [[alternative HTML version deleted]]