Search
Question: EdgeR GLM
0
6.5 years ago by
Shona Wood70
Shona Wood70 wrote:
Hi, I am trying to look at gene expression changes with increasing age. I have three time points young, middle age and old. I have been using the following, ordered factors and coefficient in order to get a fold change and FDR for a linear change in gene expression with increasing age. Can you look over my code and see if I have done this right? Is age.L the fold change in gene expression with increasing age? > library(edgeR) > library(limma) > targets<-read.delim (file="coding_targets.txt") > targets$age<-factor(ordered (targets$age)) > targets$sample<-factor(targets$sample) > targets X files age sample 1 p1 p1_coding_counts.txt a-young 1 2 p2 p2_coding_counts.txt a-young 2 3 p3 p3_coding_counts.txt a-young 3 4 p7 p7_coding_counts.txt b-middle 7 5 p8 p8_coding_counts.txt b-middle 8 6 p9 p9_coding_counts.txt b-middle 9 7 p4 p4_coding_counts.txt c-old 4 8 p5 p5_coding_counts.txt c-old 5 9 p6 p6_coding_counts.txt c-old 6 > d<-readDGE(targets, skip=1, comment.char='#') > colnames(d)<-row.names(targets) > d<- calcNormFactors(d) > d An object of class "DGEList" $samples X files age sample lib.size norm.factors 1 p1 p1_coding_counts.txt a-young 1 3445622 0.9655724 2 p2 p2_coding_counts.txt a-young 2 2696547 0.9902573 3 p3 p3_coding_counts.txt a-young 3 3308099 0.9787044 4 p7 p7_coding_counts.txt b-middle 7 2503479 1.0584660 5 p8 p8_coding_counts.txt b-middle 8 2639127 1.0477893 6 p9 p9_coding_counts.txt b-middle 9 2696547 0.9902573 7 p4 p4_coding_counts.txt c-old 4 3037440 0.9778553 8 p5 p5_coding_counts.txt c-old 5 2647915 1.0144109 9 p6 p6_coding_counts.txt c-old 6 2475370 0.9809077$counts 1 2 3 4 5 6 7 8 9 ENSRNOG00000000007 1287 1285 1041 788 968 1285 1092 1009 960 ENSRNOG00000000008 0 0 0 0 1 0 0 1 0 ENSRNOG00000000009 0 0 0 3 0 0 1 0 0 ENSRNOG00000000010 38 405 50 18 105 405 372 42 282 ENSRNOG00000000012 0 0 0 0 0 0 0 0 0 22932 more rows ... > d<-d[rowSums(d$counts)>9,] > design<- model.matrix(~ age, data = targets) > design (Intercept) age.L age.Q 1 1 -7.071068e-01 0.4082483 2 1 -7.071068e-01 0.4082483 3 1 -7.071068e-01 0.4082483 4 1 -7.850462e-17 -0.8164966 5 1 -7.850462e-17 -0.8164966 6 1 -7.850462e-17 -0.8164966 7 1 7.071068e-01 0.4082483 8 1 7.071068e-01 0.4082483 9 1 7.071068e-01 0.4082483 attr(,"assign") [1] 0 1 1 attr(,"contrasts") attr(,"contrasts")$age [1] "contr.poly" > d<-estimateGLMCommonDisp(d, design) > names(d) [1] "samples" "counts" "common.dispersion" > glmfit<- glmFit(d, design, dispersion=d$common.dispersion) > results<- glmLRT(d, glmfit, coef=c(2,3)) > topTags(results) Coefficient: age.L age.Q logConc age.L age.Q LR P.Value ENSRNOG00000011672 -11.27122 -3.2445886 -1.9642177 109.53641 1.638591e-24 Many Thanks Shona [[alternative HTML version deleted]] ADD COMMENTlink modified 6.4 years ago by Gordon Smyth33k • written 6.5 years ago by Shona Wood70 0 6.5 years ago by Gordon Smyth33k Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia Gordon Smyth33k wrote: Dear Shona, Yes, age.L is the log2-fold-change for increasing age. However your topTags table is testing for the linear and quadratic terms together, so it is testing the significance of any differences between the three age groups. I'd suggest you test specifically whether the quadratic term is required using glmLRT(d, glmfit, coef=3). I'd also generally recommend you go on to estimate the trended and tagwise dispersions, not just the common dispersion. Best wishes Gordon ADD COMMENTlink modified 3.6 years ago • written 6.5 years ago by Gordon Smyth33k Dear Gordon, Thank you for getting back to me. I am not a statistician so forgive me if I am asking basic questions. I was just reading about quadratic equations. Does a significant hit in the quadratic term mean that the rate of change first increases and then decreases over time (or visa versa)? Therefore hits which require the quadratic term are not a linear up or down with age they are only "linear" after middle age? If this is the case what does the logFC for age.Q represent, is it the fold change between middle age and old? And how is it calculated? Does the gene expression at young age affect the quadratic calculation? For example if the expression at young age is 2 and then 4 at middle and then 1.99 at old would the fact that old age is basically back to the young expression level stop it being a significant quadratic result or would it be significant because of the change from middle age. I have done the quadratic term separately as suggested but I also did the linear term separately (glmLRT(d, glmfit, coef=2)). When I compare the results of both these tests to glmLRT(d, glmfit, coef=2, 3) there are some significant genes in the glmLRT(d, glmfit, coef=2,3), that aren't in either of the separate quadratic or linear tests - why would this be? Finally if i want to report changes with increasing age would it be best to present the results from the quadratic and linear tests separately or would it be better to present the combined test (glmLRT(d, glmfit, coef=2, 3)) and point out which ones rely on the quadratic term? Many thanks Shona ________________________________________ From: Gordon K Smyth [smyth@wehi.EDU.AU] Sent: 10 December 2011 07:10 To: Wood, Shona Cc: Bioconductor mailing list Subject: EdgeR GLM Dear Shona, Yes, age.L is the log2-fold-change for increasing age. However your topTags table is testing for the linear and quadratic terms together, so it is testing the significance of any differences between the three age groups. I'd suggest you test whether the quadratic term is required using glmLRT(d, glmfit, coef=3) I'd also generally recommend you go on to estimate the trended and tagwise dispersions, not just the common dispersion. Best wishes Gordon > Date: Thu, 8 Dec 2011 17:41:07 +0000 > From: "Wood, Shona" <shona.wood at="" liverpool.ac.uk=""> > To: "bioconductor at stat.math.ethz.ch" <bioconductor at="" stat.math.ethz.ch=""> > Subject: [BioC] EdgeR GLM > > Hi, > > I am trying to look at gene expression changes with increasing age. I > have three time points young, middle age and old. I have been using the > following, ordered factors and coefficient in order to get a fold change > and FDR for a linear change in gene expression with increasing age. Can > you look over my code and see if I have done this right? Is age.L the > fold change in gene expression with increasing age? > >> library(edgeR) >> library(limma) >> targets<-read.delim (file="coding_targets.txt") >> targets$age<-factor(ordered (targets$age)) >> targets$sample<-factor(targets$sample) >> targets > X files age sample > 1 p1 p1_coding_counts.txt a-young 1 > 2 p2 p2_coding_counts.txt a-young 2 > 3 p3 p3_coding_counts.txt a-young 3 > 4 p7 p7_coding_counts.txt b-middle 7 > 5 p8 p8_coding_counts.txt b-middle 8 > 6 p9 p9_coding_counts.txt b-middle 9 > 7 p4 p4_coding_counts.txt c-old 4 > 8 p5 p5_coding_counts.txt c-old 5 > 9 p6 p6_coding_counts.txt c-old 6 > >> d<-readDGE(targets, skip=1, comment.char='#') >> colnames(d)<-row.names(targets) >> d<- calcNormFactors(d) >> d > An object of class "DGEList" >$samples > X files age sample lib.size norm.factors > 1 p1 p1_coding_counts.txt a-young 1 3445622 0.9655724 > 2 p2 p2_coding_counts.txt a-young 2 2696547 0.9902573 > 3 p3 p3_coding_counts.txt a-young 3 3308099 0.9787044 > 4 p7 p7_coding_counts.txt b-middle 7 2503479 1.0584660 > 5 p8 p8_coding_counts.txt b-middle 8 2639127 1.0477893 > 6 p9 p9_coding_counts.txt b-middle 9 2696547 0.9902573 > 7 p4 p4_coding_counts.txt c-old 4 3037440 0.9778553 > 8 p5 p5_coding_counts.txt c-old 5 2647915 1.0144109 > 9 p6 p6_coding_counts.txt c-old 6 2475370 0.9809077 > > $counts > 1 2 3 4 5 6 7 8 9 > ENSRNOG00000000007 1287 1285 1041 788 968 1285 1092 1009 960 > ENSRNOG00000000008 0 0 0 0 1 0 0 1 0 > ENSRNOG00000000009 0 0 0 3 0 0 1 0 0 > ENSRNOG00000000010 38 405 50 18 105 405 372 42 282 > ENSRNOG00000000012 0 0 0 0 0 0 0 0 0 > 22932 more rows ... > >> d<-d[rowSums(d$counts)>9,] >> design<- model.matrix(~ age, data = targets) >> design > (Intercept) age.L age.Q > 1 1 -7.071068e-01 0.4082483 > 2 1 -7.071068e-01 0.4082483 > 3 1 -7.071068e-01 0.4082483 > 4 1 -7.850462e-17 -0.8164966 > 5 1 -7.850462e-17 -0.8164966 > 6 1 -7.850462e-17 -0.8164966 > 7 1 7.071068e-01 0.4082483 > 8 1 7.071068e-01 0.4082483 > 9 1 7.071068e-01 0.4082483 > attr(,"assign") > [1] 0 1 1 > attr(,"contrasts") > attr(,"contrasts")$age > [1] "contr.poly" > >> d<-estimateGLMCommonDisp(d, design) >> names(d) > [1] "samples" "counts" "common.dispersion" >> glmfit<- glmFit(d, design, dispersion=d$common.dispersion) >> results<- glmLRT(d, glmfit, coef=c(2,3)) >> topTags(results) > Coefficient: age.L age.Q > logConc age.L age.Q LR P.Value > ENSRNOG00000011672 -11.27122 -3.2445886 -1.9642177 109.53641 1.638591e-24 > > Many Thanks > > Shona ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:6}}
Dear Gordon, Thank you for getting back to me. I am not a statistician so forgive me if I am asking basic questions. I was just reading about quadratic equations. Does a significant hit in the quadratic term mean that the rate of change first increases and then decreases over time (or visa versa)? Therefore hits which require the quadratic term are not a linear up or down with age they are only "linear" after middle age? If this is the case what does the logFC for age.Q represent, is it the fold change between middle age and old? And how is it calculated? Does the gene expression at young age affect the quadratic calculation? For example if the expression at young age is 2 and then 4 at middle and then 1.99 at old would the fact that old age is basically back to the young expression level stop it being a significant quadratic result or would it be significant because of the change from middle age. I have done the quadratic term separately as suggested but I also did the linear term separately (glmLRT(d, glmfit, coef=2)). When I compare the results of both these tests to glmLRT(d, glmfit, coef=2, 3) there are some significant genes in the glmLRT(d, glmfit, coef=2,3), that aren't in either of the separate quadratic or linear tests - why would this be? Finally if i want to report changes with increasing age would it be best to present the results from the quadratic and linear tests separately or would it be better to present the combined test (glmLRT(d, glmfit, coef=2, 3)) and point out which ones rely on the quadratic term? Many thanks Shona p.s. apologies if this comes through twice there seems to be a problem with my membership to the mailing list ________________________________________ From: Gordon K Smyth [smyth@wehi.EDU.AU] Sent: 10 December 2011 07:10 To: Wood, Shona Cc: Bioconductor mailing list Subject: EdgeR GLM Dear Shona, Yes, age.L is the log2-fold-change for increasing age. However your topTags table is testing for the linear and quadratic terms together, so it is testing the significance of any differences between the three age groups. I'd suggest you test whether the quadratic term is required using glmLRT(d, glmfit, coef=3) I'd also generally recommend you go on to estimate the trended and tagwise dispersions, not just the common dispersion. Best wishes Gordon > Date: Thu, 8 Dec 2011 17:41:07 +0000 > From: "Wood, Shona" <shona.wood at="" liverpool.ac.uk=""> > To: "bioconductor at stat.math.ethz.ch" <bioconductor at="" stat.math.ethz.ch=""> > Subject: [BioC] EdgeR GLM > > Hi, > > I am trying to look at gene expression changes with increasing age. I > have three time points young, middle age and old. I have been using the > following, ordered factors and coefficient in order to get a fold change > and FDR for a linear change in gene expression with increasing age. Can > you look over my code and see if I have done this right? Is age.L the > fold change in gene expression with increasing age? > >> library(edgeR) >> library(limma) >> targets<-read.delim (file="coding_targets.txt") >> targets$age<-factor(ordered (targets$age)) >> targets$sample<-factor(targets$sample) >> targets > X files age sample > 1 p1 p1_coding_counts.txt a-young 1 > 2 p2 p2_coding_counts.txt a-young 2 > 3 p3 p3_coding_counts.txt a-young 3 > 4 p7 p7_coding_counts.txt b-middle 7 > 5 p8 p8_coding_counts.txt b-middle 8 > 6 p9 p9_coding_counts.txt b-middle 9 > 7 p4 p4_coding_counts.txt c-old 4 > 8 p5 p5_coding_counts.txt c-old 5 > 9 p6 p6_coding_counts.txt c-old 6 > >> d<-readDGE(targets, skip=1, comment.char='#') >> colnames(d)<-row.names(targets) >> d<- calcNormFactors(d) >> d > An object of class "DGEList" > $samples > X files age sample lib.size norm.factors > 1 p1 p1_coding_counts.txt a-young 1 3445622 0.9655724 > 2 p2 p2_coding_counts.txt a-young 2 2696547 0.9902573 > 3 p3 p3_coding_counts.txt a-young 3 3308099 0.9787044 > 4 p7 p7_coding_counts.txt b-middle 7 2503479 1.0584660 > 5 p8 p8_coding_counts.txt b-middle 8 2639127 1.0477893 > 6 p9 p9_coding_counts.txt b-middle 9 2696547 0.9902573 > 7 p4 p4_coding_counts.txt c-old 4 3037440 0.9778553 > 8 p5 p5_coding_counts.txt c-old 5 2647915 1.0144109 > 9 p6 p6_coding_counts.txt c-old 6 2475370 0.9809077 > >$counts > 1 2 3 4 5 6 7 8 9 > ENSRNOG00000000007 1287 1285 1041 788 968 1285 1092 1009 960 > ENSRNOG00000000008 0 0 0 0 1 0 0 1 0 > ENSRNOG00000000009 0 0 0 3 0 0 1 0 0 > ENSRNOG00000000010 38 405 50 18 105 405 372 42 282 > ENSRNOG00000000012 0 0 0 0 0 0 0 0 0 > 22932 more rows ... > >> d<-d[rowSums(d$counts)>9,] >> design<- model.matrix(~ age, data = targets) >> design > (Intercept) age.L age.Q > 1 1 -7.071068e-01 0.4082483 > 2 1 -7.071068e-01 0.4082483 > 3 1 -7.071068e-01 0.4082483 > 4 1 -7.850462e-17 -0.8164966 > 5 1 -7.850462e-17 -0.8164966 > 6 1 -7.850462e-17 -0.8164966 > 7 1 7.071068e-01 0.4082483 > 8 1 7.071068e-01 0.4082483 > 9 1 7.071068e-01 0.4082483 > attr(,"assign") > [1] 0 1 1 > attr(,"contrasts") > attr(,"contrasts")$age > [1] "contr.poly" > >> d<-estimateGLMCommonDisp(d, design) >> names(d) > [1] "samples" "counts" "common.dispersion" >> glmfit<- glmFit(d, design, dispersion=d$common.dispersion) >> results<- glmLRT(d, glmfit, coef=c(2,3)) >> topTags(results) > Coefficient: age.L age.Q > logConc age.L age.Q LR P.Value > ENSRNOG00000011672 -11.27122 -3.2445886 -1.9642177 109.53641 1.638591e-24 > > Many Thanks > > Shona ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:6}} ADD REPLYlink written 6.4 years ago by Shona Wood70 Dear Mark, Gordon, and fellow BioConductoRs: First, thank you for numerous contributions to the BioConductor project, and more broadly, to the statistics community. I was curious if I should expect the same result from the two calls below. >From looking at the contents of estimateGLMCommonDisp.default, I think so, but I may be missing something with the default values that are passed into dispCoxReid from estimateGLMCommonDisp. dge.ctl.filt <- estimateGLMCommonDisp(y=dge.ctl.filt, design=ctl.design,method="CoxReid") dge.ctl.filt$common.dispersion [1] 0.02058129 names(dge.ctl.filt) [1] "samples" "counts" "all.zeros" "common.dispersion" dispCoxReid(y=dge.ctl.filt, design=ctl.design) [1] 0.2124361 My design matrix is nearly identical to the vignette example on Tuch's data (paired), but with one more subject. ctl.design int id.2172 id.2186 id.2234 id.2244 Trt [1,] 1 1 0 0 0 0 [2,] 1 1 0 0 0 1 [3,] 1 0 0 0 1 0 [4,] 1 0 0 0 1 1 [5,] 1 0 0 1 0 0 [6,] 1 0 0 1 0 1 [7,] 1 0 1 0 0 0 [8,] 1 0 1 0 0 1 [9,] 1 0 0 0 0 0 [10,] 1 0 0 0 0 1 attr(,"assign") [1] 0 1 1 1 1 2 attr(,"contrasts") attr(,"contrasts")$factor(cowID.ctl) [1] "contr.treatment" attr(,"contrasts")$dge.ctl.filt$samples$group [1] "contr.treatment" Thanks for your attention. My session info is below. Wade > sessionInfo() R version 2.14.0 (2011-10-31) Platform: x86_64-pc-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] limma_3.10.0 edgeR_2.4.1
Dear Mark and Gordon: I have a few questions and suggestions for the plotMeanVar function (which I think is very handy) in edgeR. I receive the following message when trying to use dispersion.method="coxreid" in plotMeanVar: plotMeanVar(dge.ctl.filt, meanvar=ctl.filt.meanvar, show.tagwise.vars=TRUE, NBline=TRUE, dispersion.method="coxreid") Error in plotMeanVar(dge.ctl.filt, meanvar = ctl.filt.meanvar, show.tagwise.vars = TRUE, : Could not extract Cox-Reid common dispersion. Try running CRDisp on the DGEList object before plotMeanVar. I had run estimateGLMCommonDisp/estimateGLMTrendedDisp/estimateGLMTagwiseDisp prior to this call, knowing that they have superseded CRDisp / estimateCRDisp / etc. names(dge.ctl.filt) [1] "samples" "counts" "all.zeros" "common.dispersion" "trended.dispersion" "abundance" [7] "bin.dispersion" "bin.abundance" "tagwise.dispersion" NOTE: Everything runs without complaint with: plotMeanVar(dge.ctl.filt, meanvar=ctl.filt.meanvar, show.tagwise.vars=TRUE, NBline=TRUE, dispersion.method="qcml") Looking into the plotMeanVar source, it seems that the function is looking for elements object$CR.common.dispersion or object$CR.tagwise.dispersion. Snippet from plotMeanVar: ------------------------------------------- if (NBline | show.tagwise.vars) { if (dispersion.method == "coxreid") { common.dispersion <- object$CR.common.dispersion tagwise.dispersion <- object$CR.tagwise.dispersion } else { common.dispersion <- object$common.dispersion tagwise.dispersion <- object$tagwise.dispersion } if (is.null(common.dispersion)) { if (dispersion.method == "coxreid") stop("Could not extract Cox-Reid common dispersion. Try running CRDisp on the DGEList object before plotMeanVar.\n") else stop("Could not extract qCML common dispersion. Try running estimateCommonDisp on the DGEList object before plotMeanVar.\n") } } ------------------------------------------- It is my understanding that the names CR.* are not being used anymore for DGEList objects. I personally like the CR.* convention because as it stands now, I can't examine an DGEList object and tell by what dispersion method object$common.dispersion or estimateTagwiseDisp was obtained (i.e., estimateCommonDisp or estimateGLMCommonDisp). So would you politely consider modifying the DGEList-class to address this? Or I have I misdiagnosed the problem entirely? Thanks for your insight on the matter and your time. My session info is below. Wade > sessionInfo() R version 2.14.0 (2011-10-31) Platform: x86_64-pc-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] limma_3.10.0 edgeR_2.4.1 ADD REPLYlink written 6.4 years ago by Davis, Wade340 Dear Wade, You are right on both fronts. plotMeanVar() needs to be updated to reflect the changes in the dispersion estimation functions. Thanks for the heads-up on this. Actually, I think plotMeanVar() will correctly use whatever dispersion estimates you have previously estimated, so its dispersion.method argument is actually redundant -- you can just leave it at the default value. I will confer with Davis McCarthy, the author of the function, to come to a complete solution. I agree with you that it would be desirable for the data object to 'remember' which dispersion estimation method was used. We have been thinking along the same lines, and plan to introduce a component called "dispersion.method" or similar into the data object. Best wishes Gordon --------------------------------------------- Professor Gordon K Smyth, Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Vic 3052, Australia. Tel: (03) 9345 2326, Fax (03) 9347 0852, smyth at wehi.edu.au http://www.wehi.edu.au http://www.statsci.org/smyth On Tue, 13 Dec 2011, Davis, Wade wrote: > Dear Mark and Gordon: > > I have a few questions and suggestions for the plotMeanVar function > (which I think is very handy) in edgeR. > > I receive the following message when trying to use > dispersion.method="coxreid" in plotMeanVar: > > plotMeanVar(dge.ctl.filt, meanvar=ctl.filt.meanvar, show.tagwise.vars=TRUE, NBline=TRUE, dispersion.method="coxreid") > Error in plotMeanVar(dge.ctl.filt, meanvar = ctl.filt.meanvar, show.tagwise.vars = TRUE, : > Could not extract Cox-Reid common dispersion. Try running CRDisp on the DGEList object before plotMeanVar. > > I had run > estimateGLMCommonDisp/estimateGLMTrendedDisp/estimateGLMTagwiseDisp > prior to this call, knowing that they have superseded CRDisp / > estimateCRDisp / etc. > > names(dge.ctl.filt) > [1] "samples" "counts" "all.zeros" "common.dispersion" "trended.dispersion" "abundance" > [7] "bin.dispersion" "bin.abundance" "tagwise.dispersion" > > > NOTE: Everything runs without complaint with: > plotMeanVar(dge.ctl.filt, meanvar=ctl.filt.meanvar, show.tagwise.vars=TRUE, NBline=TRUE, dispersion.method="qcml") > > > Looking into the plotMeanVar source, it seems that the function is looking for elements object$CR.common.dispersion or object$CR.tagwise.dispersion. > > > Snippet from plotMeanVar: > ------------------------------------------- > if (NBline | show.tagwise.vars) { > if (dispersion.method == "coxreid") { > common.dispersion <- object$CR.common.dispersion > tagwise.dispersion <- object$CR.tagwise.dispersion > } > else { > common.dispersion <- object$common.dispersion > tagwise.dispersion <- object$tagwise.dispersion > } > if (is.null(common.dispersion)) { > if (dispersion.method == "coxreid") > stop("Could not extract Cox-Reid common dispersion. Try running CRDisp on the DGEList object before plotMeanVar.\n") > else stop("Could not extract qCML common dispersion. Try running estimateCommonDisp on the DGEList object before plotMeanVar.\n") > } > } > ------------------------------------------- > > It is my understanding that the names CR.* are not being used anymore for DGEList objects. I personally like the CR.* convention because as it stands now, I can't examine an DGEList object and tell by what dispersion method object$common.dispersion or estimateTagwiseDisp was obtained (i.e., estimateCommonDisp or estimateGLMCommonDisp). > > So would you politely consider modifying the DGEList-class to address this? Or I have I misdiagnosed the problem entirely? > > Thanks for your insight on the matter and your time. > > My session info is below. > Wade > > >> sessionInfo() > R version 2.14.0 (2011-10-31) > Platform: x86_64-pc-mingw32/x64 (64-bit) > > locale: > [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 > [4] LC_NUMERIC=C LC_TIME=English_United States.1252 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] limma_3.10.0 edgeR_2.4.1 > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}
Hi Gordon and others I've just committed changes to Bioc-devel fixing the problem that Wade described. Essentially I just removed the "dispersion.method" (redundant now, I agree) and altered the rest of the function to adapt to that. I haven't used plotMeanVar() for a long time, which explains why we missed the obvious problems of using it with the new dispersion estimation methods. But glad to hear that it's useful for people and it should work properly now (I did a small amount of testing). I also updated the documentation to reflect the changes and bumped the version number. Let me know if there are any more issues and I'll get on it. I also agree that it would be good for the data object to 'remember' which dispersion method was used. Best wishes Davis On 13 December 2011 22:44, Gordon K Smyth <smyth at="" wehi.edu.au=""> wrote: > Dear Wade, > > You are right on both fronts. ?plotMeanVar() needs to be updated to reflect > the changes in the dispersion estimation functions. ?Thanks for the heads-up > on this. ?Actually, I think plotMeanVar() will correctly use whatever > dispersion estimates you have previously estimated, so its > dispersion.method argument is actually redundant -- you can just leave it at > the default value. ?I will confer with Davis McCarthy, the author of the > function, to come to a complete solution. > > I agree with you that it would be desirable for the data object to > 'remember' which dispersion estimation method was used. ?We have been > thinking along the same lines, and plan to introduce a component called > "dispersion.method" or similar into the data object. > > Best wishes > Gordon > > --------------------------------------------- > Professor Gordon K Smyth, > Bioinformatics Division, > Walter and Eliza Hall Institute of Medical Research, > 1G Royal Parade, Parkville, Vic 3052, Australia. > Tel: (03) 9345 2326, Fax (03) 9347 0852, > smyth at wehi.edu.au > http://www.wehi.edu.au > http://www.statsci.org/smyth > > On Tue, 13 Dec 2011, Davis, Wade wrote: > >> Dear Mark and Gordon: >> >> I have a few questions and suggestions for the plotMeanVar function (which >> I think is very handy) in edgeR. >> >> I receive the following message when trying to use >> dispersion.method="coxreid" in plotMeanVar: >> >> plotMeanVar(dge.ctl.filt, meanvar=ctl.filt.meanvar, >> show.tagwise.vars=TRUE, NBline=TRUE, dispersion.method="coxreid") >> Error in plotMeanVar(dge.ctl.filt, meanvar = ctl.filt.meanvar, >> show.tagwise.vars = TRUE, ?: >> ?Could not extract Cox-Reid common dispersion. Try running CRDisp on the >> DGEList object before plotMeanVar. >> >> I had run >> estimateGLMCommonDisp/estimateGLMTrendedDisp/estimateGLMTagwiseDisp prior to >> this call, knowing that they have superseded CRDisp / estimateCRDisp / etc. >> >> names(dge.ctl.filt) >> [1] "samples" ? ? ? ? ? ?"counts" ? ? ? ? ? ? "all.zeros" >> ?"common.dispersion" ?"trended.dispersion" "abundance" >> [7] "bin.dispersion" ? ? "bin.abundance" ? ? ?"tagwise.dispersion" >> >> >> NOTE: Everything runs without complaint with: >> plotMeanVar(dge.ctl.filt, meanvar=ctl.filt.meanvar, >> show.tagwise.vars=TRUE, NBline=TRUE, dispersion.method="qcml") >> >> >> Looking into the plotMeanVar source, it seems that the function is looking >> for elements object$CR.common.dispersion or object$CR.tagwise.dispersion. >> >> >> Snippet from plotMeanVar: >> ------------------------------------------- >> ?if (NBline | show.tagwise.vars) { >> ? ? ? if (dispersion.method == "coxreid") { >> ? ? ? ? ? common.dispersion <- object$CR.common.dispersion >> ? ? ? ? ? tagwise.dispersion <- object$CR.tagwise.dispersion >> ? ? ? } >> ? ? ? else { >> ? ? ? ? ? common.dispersion <- object$common.dispersion >> ? ? ? ? ? tagwise.dispersion <- object$tagwise.dispersion >> ? ? ? } >> ? ? ? if (is.null(common.dispersion)) { >> ? ? ? ? ? if (dispersion.method == "coxreid") >> ? ? ? ? ? ? ? stop("Could not extract Cox-Reid common dispersion. Try >> running CRDisp on the DGEList object before plotMeanVar.\n") >> ? ? ? ? ? else stop("Could not extract qCML common dispersion. Try running >> estimateCommonDisp on the DGEList object before plotMeanVar.\n") >> ? ? ? } >> ? } >> ------------------------------------------- >> >> It is my understanding that the names CR.* are not being used anymore for >> DGEList objects. I personally like the CR.* convention because as it stands >> now, I can't examine an DGEList object and tell by what dispersion method >> object$common.dispersion or estimateTagwiseDisp was obtained (i.e., >> estimateCommonDisp or estimateGLMCommonDisp). >> >> So would you politely consider modifying the DGEList-class to address >> this? Or I have I misdiagnosed the problem entirely? >> >> Thanks for your insight on the matter and your time. >> >> My session info is below. >> Wade >> >> >>> sessionInfo() >> >> R version 2.14.0 (2011-10-31) >> Platform: x86_64-pc-mingw32/x64 (64-bit) >> >> locale: >> [1] LC_COLLATE=English_United States.1252 ?LC_CTYPE=English_United >> States.1252 ? ?LC_MONETARY=English_United States.1252 >> [4] LC_NUMERIC=C ? ? ? ? ? ? ? ? ? ? ? ? ? LC_TIME=English_United >> States.1252 >> >> attached base packages: >> [1] stats ? ? graphics ?grDevices utils ? ? datasets ?methods ? base >> >> other attached packages: >> [1] limma_3.10.0 edgeR_2.4.1 >> > > ______________________________________________________________________ > The information in this email is confidential and intended solely for the > addressee. > You must not disclose, forward, print or use it without the permission of > the sender. > ______________________________________________________________________ -- =================================================== Davis McCarthy ?| ?PRS in Statistics Dept of Statistics & Wellcome Trust Centre for Human Genetics University of Oxford E: davis.mccarthy at balliol.ox.ac.uk W: sites.google.com/site/davismcc P: +44 7847 470810 =================================================== ADD REPLYlink written 6.4 years ago by Davis McCarthy10 Dear Davis, The dispCoxReid() function is a low-level function in edgeR that works directly on matrices rather than DGEList objects, and it's not intended to be called directly by users. You get different results from dispCoxReid() because you have not passed to it the offsets based on library sizes and normalization. estimateGLMCommonDisp() extracts this information automatically from the object, but the low-level functions don't do this. Best wishes Gordon --------------------------------------------- Professor Gordon K Smyth, Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Vic 3052, Australia. Tel: (03) 9345 2326, Fax (03) 9347 0852, smyth at wehi.edu.au http://www.wehi.edu.au http://www.statsci.org/smyth On Tue, 13 Dec 2011, Davis, Wade wrote: > Dear Mark, Gordon, and fellow BioConductoRs: > > First, thank you for numerous contributions to the BioConductor project, > and more broadly, to the statistics community. > > I was curious if I should expect the same result from the two calls > below. From looking at the contents of estimateGLMCommonDisp.default, I > think so, but I may be missing something with the default values that > are passed into dispCoxReid from estimateGLMCommonDisp. > > dge.ctl.filt <- estimateGLMCommonDisp(y=dge.ctl.filt, design=ctl.design,method="CoxReid") dge.ctl.filt$common.dispersion > [1] 0.02058129 > > names(dge.ctl.filt) > [1] "samples" "counts" "all.zeros" "common.dispersion" > > dispCoxReid(y=dge.ctl.filt, design=ctl.design) > [1] 0.2124361 > > My design matrix is nearly identical to the vignette example on Tuch's > data (paired), but with one more subject. > > > ctl.design > int id.2172 id.2186 id.2234 id.2244 Trt > [1,] 1 1 0 0 0 0 > [2,] 1 1 0 0 0 1 > [3,] 1 0 0 0 1 0 > [4,] 1 0 0 0 1 1 > [5,] 1 0 0 1 0 0 > [6,] 1 0 0 1 0 1 > [7,] 1 0 1 0 0 0 > [8,] 1 0 1 0 0 1 > [9,] 1 0 0 0 0 0 > [10,] 1 0 0 0 0 1 > attr(,"assign") > [1] 0 1 1 1 1 2 > attr(,"contrasts") > attr(,"contrasts")$factor(cowID.ctl) > [1] "contr.treatment" > > attr(,"contrasts")$dge.ctl.filt$samples$group > [1] "contr.treatment" > > Thanks for your attention. My session info is below. > Wade > > >> sessionInfo() > R version 2.14.0 (2011-10-31) > Platform: x86_64-pc-mingw32/x64 (64-bit) > > locale: > [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 > [4] LC_NUMERIC=C LC_TIME=English_United States.1252 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] limma_3.10.0 edgeR_2.4.1 > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}
Sorry, that should be "Dear Wade". One of the authors of the estimateGLMCommonDisp() is Davis McCarthy, so I've got used to "Davis" being a first name. Best wishes Gordon On Wed, 14 Dec 2011, Gordon K Smyth wrote: > Dear Davis, > > The dispCoxReid() function is a low-level function in edgeR that works > directly on matrices rather than DGEList objects, and it's not intended to be > called directly by users. > > You get different results from dispCoxReid() because you have not passed to > it the offsets based on library sizes and normalization. > estimateGLMCommonDisp() extracts this information automatically from the > object, but the low-level functions don't do this. > > Best wishes > Gordon > > --------------------------------------------- > Professor Gordon K Smyth, > Bioinformatics Division, > Walter and Eliza Hall Institute of Medical Research, > 1G Royal Parade, Parkville, Vic 3052, Australia. > Tel: (03) 9345 2326, Fax (03) 9347 0852, > smyth at wehi.edu.au > http://www.wehi.edu.au > http://www.statsci.org/smyth > > On Tue, 13 Dec 2011, Davis, Wade wrote: > >> Dear Mark, Gordon, and fellow BioConductoRs: >> >> First, thank you for numerous contributions to the BioConductor project, >> and more broadly, to the statistics community. >> >> I was curious if I should expect the same result from the two calls below. >> From looking at the contents of estimateGLMCommonDisp.default, I think so, >> but I may be missing something with the default values that are passed into >> dispCoxReid from estimateGLMCommonDisp. >> >> dge.ctl.filt <- estimateGLMCommonDisp(y=dge.ctl.filt, >> design=ctl.design,method="CoxReid") dge.ctl.filt$common.dispersion >> [1] 0.02058129 >> >> names(dge.ctl.filt) >> [1] "samples" "counts" "all.zeros" >> "common.dispersion" >> >> dispCoxReid(y=dge.ctl.filt, design=ctl.design) >> [1] 0.2124361 >> >> My design matrix is nearly identical to the vignette example on Tuch's data >> (paired), but with one more subject. >> >> >> ctl.design >> int id.2172 id.2186 id.2234 id.2244 Trt >> [1,] 1 1 0 0 0 0 >> [2,] 1 1 0 0 0 1 >> [3,] 1 0 0 0 1 0 >> [4,] 1 0 0 0 1 1 >> [5,] 1 0 0 1 0 0 >> [6,] 1 0 0 1 0 1 >> [7,] 1 0 1 0 0 0 >> [8,] 1 0 1 0 0 1 >> [9,] 1 0 0 0 0 0 >> [10,] 1 0 0 0 0 1 >> attr(,"assign") >> [1] 0 1 1 1 1 2 >> attr(,"contrasts") >> attr(,"contrasts")$factor(cowID.ctl) >> [1] "contr.treatment" >> >> attr(,"contrasts")$dge.ctl.filt$samples$group >> [1] "contr.treatment" >> >> Thanks for your attention. My session info is below. >> Wade >> >> >>> sessionInfo() >> R version 2.14.0 (2011-10-31) >> Platform: x86_64-pc-mingw32/x64 (64-bit) >> >> locale: >> [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United >> States.1252 LC_MONETARY=English_United States.1252 >> [4] LC_NUMERIC=C LC_TIME=English_United >> States.1252 >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] limma_3.10.0 edgeR_2.4.1 >> > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}} ADD REPLYlink written 6.4 years ago by Gordon Smyth33k 0 6.4 years ago by Gordon Smyth33k Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia Gordon Smyth33k wrote: Dear Shona, I can only give you brief pointers. There are only three time points in your data. The linear coefficient in the model simply compares old with young, without regard to middle. The quadratic term compares middle with the average of young and old. So if expression goes 2-4-2 for a particular gene over the three times then you get a positive quadratic coefficient and zero linear coefficient. It is perfectly possible for the likelihood ratio test of both linear and quadratic together: glmLRT(d, glmfit, coef=2:3) to detect some genes not in the separate linear or quadratic lists, because the linear or quadratic terms might be not quite significant separately, but might be so when considered together. For example, the expression might go 2-4-3. Is this linear or quadratic? Not quite sure. But we might be confident that the three times are not all equal. If you only want to report genes that have increasing expression with age, then I would have thought you want to fit the linear term only, since the quadratic effect can never be guaranteed to be increasing. Best wishes Gordon --------------------------------------------- Professor Gordon K Smyth, Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Vic 3052, Australia. Tel: (03) 9345 2326, Fax (03) 9347 0852, smyth at wehi.edu.au http://www.wehi.edu.au http://www.statsci.org/smyth On Tue, 13 Dec 2011, Wood, Shona wrote: Dear Gordon, Thank you for getting back to me. I am not a statistician so forgive me if I am asking basic questions. I was just reading about quadratic equations. Does a significant hit in the quadratic term mean that the rate of change first increases and then decreases over time (or visa versa)? Therefore hits which require the quadratic term are not a linear up or down with age they are only "linear" after middle age? If this is the case what does the logFC for age.Q represent, is it the fold change between middle age and old? And how is it calculated? Does the gene expression at young age affect the quadratic calculation? For example if the expression at young age is 2 and then 4 at middle and then 1.99 at old would the fact that old age is basically back to the young expression level stop it being a significant quadratic result or would it be significant because of the change from middle age. I have done the quadratic term separately as suggested but I also did the linear term separately (glmLRT(d, glmfit, coef=2)). When I compare the results of both these tests to glmLRT(d, glmfit, coef=2, 3) there are some significant genes in the glmLRT(d, glmfit, coef=2,3), that aren't in either of the separate quadratic or linear tests - why would this be? Finally if i want to report changes with increasing age would it be best to present the results from the quadratic and linear tests separately or would it be better to present the combined test (glmLRT(d, glmfit, coef=2, 3)) and point out which ones rely on the quadratic term? Many thanks Shona ________________________________________ From: Gordon K Smyth [smyth@wehi.EDU.AU] Sent: 10 December 2011 07:10 To: Wood, Shona Cc: Bioconductor mailing list Subject: EdgeR GLM Dear Shona, Yes, age.L is the log2-fold-change for increasing age. However your topTags table is testing for the linear and quadratic terms together, so it is testing the significance of any differences between the three age groups. I'd suggest you test whether the quadratic term is required using glmLRT(d, glmfit, coef=3) I'd also generally recommend you go on to estimate the trended and tagwise dispersions, not just the common dispersion. Best wishes Gordon > Date: Thu, 8 Dec 2011 17:41:07 +0000 > From: "Wood, Shona" <shona.wood at="" liverpool.ac.uk=""> > To: "bioconductor at stat.math.ethz.ch" <bioconductor at="" stat.math.ethz.ch=""> > Subject: [BioC] EdgeR GLM > > Hi, > > I am trying to look at gene expression changes with increasing age. I > have three time points young, middle age and old. I have been using the > following, ordered factors and coefficient in order to get a fold change > and FDR for a linear change in gene expression with increasing age. Can > you look over my code and see if I have done this right? Is age.L the > fold change in gene expression with increasing age? > >> library(edgeR) >> library(limma) >> targets<-read.delim (file="coding_targets.txt") >> targets$age<-factor(ordered (targets$age)) >> targets$sample<-factor(targets$sample) >> targets > X files age sample > 1 p1 p1_coding_counts.txt a-young 1 > 2 p2 p2_coding_counts.txt a-young 2 > 3 p3 p3_coding_counts.txt a-young 3 > 4 p7 p7_coding_counts.txt b-middle 7 > 5 p8 p8_coding_counts.txt b-middle 8 > 6 p9 p9_coding_counts.txt b-middle 9 > 7 p4 p4_coding_counts.txt c-old 4 > 8 p5 p5_coding_counts.txt c-old 5 > 9 p6 p6_coding_counts.txt c-old 6 > >> d<-readDGE(targets, skip=1, comment.char='#') >> colnames(d)<-row.names(targets) >> d<- calcNormFactors(d) >> d > An object of class "DGEList" >$samples > X files age sample lib.size norm.factors > 1 p1 p1_coding_counts.txt a-young 1 3445622 0.9655724 > 2 p2 p2_coding_counts.txt a-young 2 2696547 0.9902573 > 3 p3 p3_coding_counts.txt a-young 3 3308099 0.9787044 > 4 p7 p7_coding_counts.txt b-middle 7 2503479 1.0584660 > 5 p8 p8_coding_counts.txt b-middle 8 2639127 1.0477893 > 6 p9 p9_coding_counts.txt b-middle 9 2696547 0.9902573 > 7 p4 p4_coding_counts.txt c-old 4 3037440 0.9778553 > 8 p5 p5_coding_counts.txt c-old 5 2647915 1.0144109 > 9 p6 p6_coding_counts.txt c-old 6 2475370 0.9809077 > > $counts > 1 2 3 4 5 6 7 8 9 > ENSRNOG00000000007 1287 1285 1041 788 968 1285 1092 1009 960 > ENSRNOG00000000008 0 0 0 0 1 0 0 1 0 > ENSRNOG00000000009 0 0 0 3 0 0 1 0 0 > ENSRNOG00000000010 38 405 50 18 105 405 372 42 282 > ENSRNOG00000000012 0 0 0 0 0 0 0 0 0 > 22932 more rows ... > >> d<-d[rowSums(d$counts)>9,] >> design<- model.matrix(~ age, data = targets) >> design > (Intercept) age.L age.Q > 1 1 -7.071068e-01 0.4082483 > 2 1 -7.071068e-01 0.4082483 > 3 1 -7.071068e-01 0.4082483 > 4 1 -7.850462e-17 -0.8164966 > 5 1 -7.850462e-17 -0.8164966 > 6 1 -7.850462e-17 -0.8164966 > 7 1 7.071068e-01 0.4082483 > 8 1 7.071068e-01 0.4082483 > 9 1 7.071068e-01 0.4082483 > attr(,"assign") > [1] 0 1 1 > attr(,"contrasts") > attr(,"contrasts")$age > [1] "contr.poly" > >> d<-estimateGLMCommonDisp(d, design) >> names(d) > [1] "samples" "counts" "common.dispersion" >> glmfit<- glmFit(d, design, dispersion=d$common.dispersion) >> results<- glmLRT(d, glmfit, coef=c(2,3)) >> topTags(results) > Coefficient: age.L age.Q > logConc age.L age.Q LR P.Value > ENSRNOG00000011672 -11.27122 -3.2445886 -1.9642177 109.53641 1.638591e-24 > > Many Thanks > > Shona ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}
Dear Gordon, Thank you for getting back to me and giving me those explanations. The only thing that is still confusing me is the linear coefficient. If it compares young and old without regard to middle isn't it basically the same as a pairwise comparison? Also it seems strange that it only uses the two extremities - I thought linear regression should use all the data points. For example if we had 50 different ages would 48 data points be excluded to compare the first and last only? Thanks Shona ________________________________________ From: Gordon K Smyth [smyth@wehi.EDU.AU] Sent: 13 December 2011 23:18 To: Wood, Shona Cc: Bioconductor mailing list Subject: RE: EdgeR GLM Dear Shona, I can only give you brief pointers. There are only three time points in your data. The linear coefficient in the model simply compares old with young, without regard to middle. The quadratic term compares middle with the average of young and old. So if expression goes 2-4-2 for a particular gene over the three times then you get a positive quadratic coefficient and zero linear coefficient. It is perfectly possible for the likelihood ratio test of both linear and quadratic together: glmLRT(d, glmfit, coef=2:3) to detect some genes not in the separate linear or quadratic lists, because the linear or quadratic terms might be not quite significant separately, but might be so when considered together. For example, the expression might go 2-4-3. Is this linear or quadratic? Not quite sure. But we might be confident that the three times are not all equal. If you only want to report genes that have increasing expression with age, then I would have thought you want to fit the linear term only, since the quadratic effect can never be guaranteed to be increasing. Best wishes Gordon --------------------------------------------- Professor Gordon K Smyth, Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Vic 3052, Australia. Tel: (03) 9345 2326, Fax (03) 9347 0852, smyth at wehi.edu.au http://www.wehi.edu.au http://www.statsci.org/smyth On Tue, 13 Dec 2011, Wood, Shona wrote: Dear Gordon, Thank you for getting back to me. I am not a statistician so forgive me if I am asking basic questions. I was just reading about quadratic equations. Does a significant hit in the quadratic term mean that the rate of change first increases and then decreases over time (or visa versa)? Therefore hits which require the quadratic term are not a linear up or down with age they are only "linear" after middle age? If this is the case what does the logFC for age.Q represent, is it the fold change between middle age and old? And how is it calculated? Does the gene expression at young age affect the quadratic calculation? For example if the expression at young age is 2 and then 4 at middle and then 1.99 at old would the fact that old age is basically back to the young expression level stop it being a significant quadratic result or would it be significant because of the change from middle age. I have done the quadratic term separately as suggested but I also did the linear term separately (glmLRT(d, glmfit, coef=2)). When I compare the results of both these tests to glmLRT(d, glmfit, coef=2, 3) there are some significant genes in the glmLRT(d, glmfit, coef=2,3), that aren't in either of the separate quadratic or linear tests - why would this be? Finally if i want to report changes with increasing age would it be best to present the results from the quadratic and linear tests separately or would it be better to present the combined test (glmLRT(d, glmfit, coef=2, 3)) and point out which ones rely on the quadratic term? Many thanks Shona ________________________________________ From: Gordon K Smyth [smyth@wehi.EDU.AU] Sent: 10 December 2011 07:10 To: Wood, Shona Cc: Bioconductor mailing list Subject: EdgeR GLM Dear Shona, Yes, age.L is the log2-fold-change for increasing age. However your topTags table is testing for the linear and quadratic terms together, so it is testing the significance of any differences between the three age groups. I'd suggest you test whether the quadratic term is required using glmLRT(d, glmfit, coef=3) I'd also generally recommend you go on to estimate the trended and tagwise dispersions, not just the common dispersion. Best wishes Gordon > Date: Thu, 8 Dec 2011 17:41:07 +0000 > From: "Wood, Shona" <shona.wood at="" liverpool.ac.uk=""> > To: "bioconductor at stat.math.ethz.ch" <bioconductor at="" stat.math.ethz.ch=""> > Subject: [BioC] EdgeR GLM > > Hi, > > I am trying to look at gene expression changes with increasing age. I > have three time points young, middle age and old. I have been using the > following, ordered factors and coefficient in order to get a fold change > and FDR for a linear change in gene expression with increasing age. Can > you look over my code and see if I have done this right? Is age.L the > fold change in gene expression with increasing age? > >> library(edgeR) >> library(limma) >> targets<-read.delim (file="coding_targets.txt") >> targets$age<-factor(ordered (targets$age)) >> targets$sample<-factor(targets$sample) >> targets > X files age sample > 1 p1 p1_coding_counts.txt a-young 1 > 2 p2 p2_coding_counts.txt a-young 2 > 3 p3 p3_coding_counts.txt a-young 3 > 4 p7 p7_coding_counts.txt b-middle 7 > 5 p8 p8_coding_counts.txt b-middle 8 > 6 p9 p9_coding_counts.txt b-middle 9 > 7 p4 p4_coding_counts.txt c-old 4 > 8 p5 p5_coding_counts.txt c-old 5 > 9 p6 p6_coding_counts.txt c-old 6 > >> d<-readDGE(targets, skip=1, comment.char='#') >> colnames(d)<-row.names(targets) >> d<- calcNormFactors(d) >> d > An object of class "DGEList" > $samples > X files age sample lib.size norm.factors > 1 p1 p1_coding_counts.txt a-young 1 3445622 0.9655724 > 2 p2 p2_coding_counts.txt a-young 2 2696547 0.9902573 > 3 p3 p3_coding_counts.txt a-young 3 3308099 0.9787044 > 4 p7 p7_coding_counts.txt b-middle 7 2503479 1.0584660 > 5 p8 p8_coding_counts.txt b-middle 8 2639127 1.0477893 > 6 p9 p9_coding_counts.txt b-middle 9 2696547 0.9902573 > 7 p4 p4_coding_counts.txt c-old 4 3037440 0.9778553 > 8 p5 p5_coding_counts.txt c-old 5 2647915 1.0144109 > 9 p6 p6_coding_counts.txt c-old 6 2475370 0.9809077 > >$counts > 1 2 3 4 5 6 7 8 9 > ENSRNOG00000000007 1287 1285 1041 788 968 1285 1092 1009 960 > ENSRNOG00000000008 0 0 0 0 1 0 0 1 0 > ENSRNOG00000000009 0 0 0 3 0 0 1 0 0 > ENSRNOG00000000010 38 405 50 18 105 405 372 42 282 > ENSRNOG00000000012 0 0 0 0 0 0 0 0 0 > 22932 more rows ... > >> d<-d[rowSums(d$counts)>9,] >> design<- model.matrix(~ age, data = targets) >> design > (Intercept) age.L age.Q > 1 1 -7.071068e-01 0.4082483 > 2 1 -7.071068e-01 0.4082483 > 3 1 -7.071068e-01 0.4082483 > 4 1 -7.850462e-17 -0.8164966 > 5 1 -7.850462e-17 -0.8164966 > 6 1 -7.850462e-17 -0.8164966 > 7 1 7.071068e-01 0.4082483 > 8 1 7.071068e-01 0.4082483 > 9 1 7.071068e-01 0.4082483 > attr(,"assign") > [1] 0 1 1 > attr(,"contrasts") > attr(,"contrasts")$age > [1] "contr.poly" > >> d<-estimateGLMCommonDisp(d, design) >> names(d) > [1] "samples" "counts" "common.dispersion" >> glmfit<- glmFit(d, design, dispersion=d$common.dispersion) >> results<- glmLRT(d, glmfit, coef=c(2,3)) >> topTags(results) > Coefficient: age.L age.Q > logConc age.L age.Q LR P.Value > ENSRNOG00000011672 -11.27122 -3.2445886 -1.9642177 109.53641 1.638591e-24 > > Many Thanks > > Shona ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:6}} ADD REPLYlink written 6.4 years ago by Shona Wood70 Dear Shona, Of course linear regression would use all the data points if you had 50 ages, but you have in fact only three. If there are an odd number of equally spaced ages, the middle one will not contribute to the coefficient. That's just how the math works. Polynomial regression isn't an invention of the edgeR package, and how it works in R is determined by the base functions in R, not those in edgeR. If you have further questions about linear or quadratic regression itself, it might be best to read separately, or else to ask for help on a more general list, probably the main R help list. Best wishes Gordon --------------------------------------------- Professor Gordon K Smyth, Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Vic 3052, Australia. http://www.wehi.edu.au http://www.statsci.org/smyth On Thu, 15 Dec 2011, Wood, Shona wrote: Dear Gordon, Thank you for getting back to me and giving me those explanations. The only thing that is still confusing me is the linear coefficient. If it compares young and old without regard to middle isn't it basically the same as a pairwise comparison? Also it seems strange that it only uses the two extremities - I thought linear regression should use all the data points. For example if we had 50 different ages would 48 data points be excluded to compare the first and last only? Thanks Shona ________________________________________ From: Gordon K Smyth [smyth@wehi.EDU.AU] Sent: 13 December 2011 23:18 To: Wood, Shona Cc: Bioconductor mailing list Subject: RE: EdgeR GLM Dear Shona, I can only give you brief pointers. There are only three time points in your data. The linear coefficient in the model simply compares old with young, without regard to middle. The quadratic term compares middle with the average of young and old. So if expression goes 2-4-2 for a particular gene over the three times then you get a positive quadratic coefficient and zero linear coefficient. It is perfectly possible for the likelihood ratio test of both linear and quadratic together: glmLRT(d, glmfit, coef=2:3) to detect some genes not in the separate linear or quadratic lists, because the linear or quadratic terms might be not quite significant separately, but might be so when considered together. For example, the expression might go 2-4-3. Is this linear or quadratic? Not quite sure. But we might be confident that the three times are not all equal. If you only want to report genes that have increasing expression with age, then I would have thought you want to fit the linear term only, since the quadratic effect can never be guaranteed to be increasing. Best wishes Gordon --------------------------------------------- Professor Gordon K Smyth, Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Vic 3052, Australia. Tel: (03) 9345 2326, Fax (03) 9347 0852, smyth at wehi.edu.au http://www.wehi.edu.au http://www.statsci.org/smyth On Tue, 13 Dec 2011, Wood, Shona wrote: Dear Gordon, Thank you for getting back to me. I am not a statistician so forgive me if I am asking basic questions. I was just reading about quadratic equations. Does a significant hit in the quadratic term mean that the rate of change first increases and then decreases over time (or visa versa)? Therefore hits which require the quadratic term are not a linear up or down with age they are only "linear" after middle age? If this is the case what does the logFC for age.Q represent, is it the fold change between middle age and old? And how is it calculated? Does the gene expression at young age affect the quadratic calculation? For example if the expression at young age is 2 and then 4 at middle and then 1.99 at old would the fact that old age is basically back to the young expression level stop it being a significant quadratic result or would it be significant because of the change from middle age. I have done the quadratic term separately as suggested but I also did the linear term separately (glmLRT(d, glmfit, coef=2)). When I compare the results of both these tests to glmLRT(d, glmfit, coef=2, 3) there are some significant genes in the glmLRT(d, glmfit, coef=2,3), that aren't in either of the separate quadratic or linear tests - why would this be? Finally if i want to report changes with increasing age would it be best to present the results from the quadratic and linear tests separately or would it be better to present the combined test (glmLRT(d, glmfit, coef=2, 3)) and point out which ones rely on the quadratic term? Many thanks Shona ________________________________________ From: Gordon K Smyth [smyth@wehi.EDU.AU] Sent: 10 December 2011 07:10 To: Wood, Shona Cc: Bioconductor mailing list Subject: EdgeR GLM Dear Shona, Yes, age.L is the log2-fold-change for increasing age. However your topTags table is testing for the linear and quadratic terms together, so it is testing the significance of any differences between the three age groups. I'd suggest you test whether the quadratic term is required using glmLRT(d, glmfit, coef=3) I'd also generally recommend you go on to estimate the trended and tagwise dispersions, not just the common dispersion. Best wishes Gordon > Date: Thu, 8 Dec 2011 17:41:07 +0000 > From: "Wood, Shona" <shona.wood at="" liverpool.ac.uk=""> > To: "bioconductor at stat.math.ethz.ch" <bioconductor at="" stat.math.ethz.ch=""> > Subject: [BioC] EdgeR GLM > > Hi, > > I am trying to look at gene expression changes with increasing age. I > have three time points young, middle age and old. I have been using the > following, ordered factors and coefficient in order to get a fold change > and FDR for a linear change in gene expression with increasing age. Can > you look over my code and see if I have done this right? Is age.L the > fold change in gene expression with increasing age? > >> library(edgeR) >> library(limma) >> targets<-read.delim (file="coding_targets.txt") >> targets$age<-factor(ordered (targets$age)) >> targets$sample<-factor(targets$sample) >> targets > X files age sample > 1 p1 p1_coding_counts.txt a-young 1 > 2 p2 p2_coding_counts.txt a-young 2 > 3 p3 p3_coding_counts.txt a-young 3 > 4 p7 p7_coding_counts.txt b-middle 7 > 5 p8 p8_coding_counts.txt b-middle 8 > 6 p9 p9_coding_counts.txt b-middle 9 > 7 p4 p4_coding_counts.txt c-old 4 > 8 p5 p5_coding_counts.txt c-old 5 > 9 p6 p6_coding_counts.txt c-old 6 > >> d<-readDGE(targets, skip=1, comment.char='#') >> colnames(d)<-row.names(targets) >> d<- calcNormFactors(d) >> d > An object of class "DGEList" >$samples > X files age sample lib.size norm.factors > 1 p1 p1_coding_counts.txt a-young 1 3445622 0.9655724 > 2 p2 p2_coding_counts.txt a-young 2 2696547 0.9902573 > 3 p3 p3_coding_counts.txt a-young 3 3308099 0.9787044 > 4 p7 p7_coding_counts.txt b-middle 7 2503479 1.0584660 > 5 p8 p8_coding_counts.txt b-middle 8 2639127 1.0477893 > 6 p9 p9_coding_counts.txt b-middle 9 2696547 0.9902573 > 7 p4 p4_coding_counts.txt c-old 4 3037440 0.9778553 > 8 p5 p5_coding_counts.txt c-old 5 2647915 1.0144109 > 9 p6 p6_coding_counts.txt c-old 6 2475370 0.9809077 > > $counts > 1 2 3 4 5 6 7 8 9 > ENSRNOG00000000007 1287 1285 1041 788 968 1285 1092 1009 960 > ENSRNOG00000000008 0 0 0 0 1 0 0 1 0 > ENSRNOG00000000009 0 0 0 3 0 0 1 0 0 > ENSRNOG00000000010 38 405 50 18 105 405 372 42 282 > ENSRNOG00000000012 0 0 0 0 0 0 0 0 0 > 22932 more rows ... > >> d<-d[rowSums(d$counts)>9,] >> design<- model.matrix(~ age, data = targets) >> design > (Intercept) age.L age.Q > 1 1 -7.071068e-01 0.4082483 > 2 1 -7.071068e-01 0.4082483 > 3 1 -7.071068e-01 0.4082483 > 4 1 -7.850462e-17 -0.8164966 > 5 1 -7.850462e-17 -0.8164966 > 6 1 -7.850462e-17 -0.8164966 > 7 1 7.071068e-01 0.4082483 > 8 1 7.071068e-01 0.4082483 > 9 1 7.071068e-01 0.4082483 > attr(,"assign") > [1] 0 1 1 > attr(,"contrasts") > attr(,"contrasts")$age > [1] "contr.poly" > >> d<-estimateGLMCommonDisp(d, design) >> names(d) > [1] "samples" "counts" "common.dispersion" >> glmfit<- glmFit(d, design, dispersion=d$common.dispersion) >> results<- glmLRT(d, glmfit, coef=c(2,3)) >> topTags(results) > Coefficient: age.L age.Q > logConc age.L age.Q LR P.Value > ENSRNOG00000011672 -11.27122 -3.2445886 -1.9642177 109.53641 1.638591e-24 > > Many Thanks > > Shona ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}
Dear Shona, One last suggestion. Try typing contrasts(age) This tells you how the coefficients are determined. I would actually suggest you simply put age in as a numericl column (1=young,2=middle,3=old) into your design matrix, instead of defining an ordinal factor. Best wishes Gordon --------------------------------------------- Professor Gordon K Smyth, Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Vic 3052, Australia. Tel: (03) 9345 2326, Fax (03) 9347 0852, smyth at wehi.edu.au http://www.wehi.edu.au http://www.statsci.org/smyth On Fri, 16 Dec 2011, Gordon K Smyth wrote: > Dear Shona, > > Of course linear regression would use all the data points if you had 50 ages, > but you have in fact only three. > > If there are an odd number of equally spaced ages, the middle one will not > contribute to the coefficient. That's just how the math works. > > Polynomial regression isn't an invention of the edgeR package, and how it > works in R is determined by the base functions in R, not those in edgeR. If > you have further questions about linear or quadratic regression itself, it > might be best to read separately, or else to ask for help on a more general > list, probably the main R help list. > > Best wishes > Gordon > > --------------------------------------------- > Professor Gordon K Smyth, > Bioinformatics Division, > Walter and Eliza Hall Institute of Medical Research, > 1G Royal Parade, Parkville, Vic 3052, Australia. > http://www.wehi.edu.au > http://www.statsci.org/smyth > > On Thu, 15 Dec 2011, Wood, Shona wrote: > > Dear Gordon, > > Thank you for getting back to me and giving me those explanations. > > The only thing that is still confusing me is the linear coefficient. If it > compares young and old without regard to middle isn't it basically the same > as a pairwise comparison? Also it seems strange that it only uses the two > extremities - I thought linear regression should use all the data points. For > example if we had 50 different ages would 48 data points be excluded to > compare the first and last only? > > Thanks > > Shona > ________________________________________ > From: Gordon K Smyth [smyth at wehi.EDU.AU] > Sent: 13 December 2011 23:18 > To: Wood, Shona > Cc: Bioconductor mailing list > Subject: RE: EdgeR GLM > > Dear Shona, > > I can only give you brief pointers. > > There are only three time points in your data. The linear coefficient in > the model simply compares old with young, without regard to middle. The > quadratic term compares middle with the average of young and old. So if > expression goes 2-4-2 for a particular gene over the three times then you > get a positive quadratic coefficient and zero linear coefficient. > > It is perfectly possible for the likelihood ratio test of both linear > and quadratic together: > > glmLRT(d, glmfit, coef=2:3) > > to detect some genes not in the separate linear or quadratic lists, > because the linear or quadratic terms might be not quite significant > separately, but might be so when considered together. For example, the > expression might go 2-4-3. Is this linear or quadratic? Not quite sure. > But we might be confident that the three times are not all equal. > > If you only want to report genes that have increasing expression with age, > then I would have thought you want to fit the linear term only, since the > quadratic effect can never be guaranteed to be increasing. > > Best wishes > Gordon > > --------------------------------------------- > Professor Gordon K Smyth, > Bioinformatics Division, > Walter and Eliza Hall Institute of Medical Research, > 1G Royal Parade, Parkville, Vic 3052, Australia. > Tel: (03) 9345 2326, Fax (03) 9347 0852, > smyth at wehi.edu.au > http://www.wehi.edu.au > http://www.statsci.org/smyth > > On Tue, 13 Dec 2011, Wood, Shona wrote: > > Dear Gordon, > > Thank you for getting back to me. I am not a statistician so forgive me if > I am asking basic questions. I was just reading about quadratic equations. > Does a significant hit in the quadratic term mean that the rate of change > first increases and then decreases over time (or visa versa)? Therefore > hits which require the quadratic term are not a linear up or down with age > they are only "linear" after middle age? > > If this is the case what does the logFC for age.Q represent, is it the > fold change between middle age and old? And how is it calculated? Does the > gene expression at young age affect the quadratic calculation? For example > if the expression at young age is 2 and then 4 at middle and then 1.99 at > old would the fact that old age is basically back to the young expression > level stop it being a significant quadratic result or would it be > significant because of the change from middle age. > > I have done the quadratic term separately as suggested but I also did the > linear term separately (glmLRT(d, glmfit, coef=2)). When I compare the > results of both these tests to glmLRT(d, glmfit, coef=2, 3) there are some > significant genes in the glmLRT(d, glmfit, coef=2,3), that aren't in > either of the separate quadratic or linear tests - why would this be? > > Finally if i want to report changes with increasing age would it be best > to present the results from the quadratic and linear tests separately or > would it be better to present the combined test (glmLRT(d, glmfit, coef=2, > 3)) and point out which ones rely on the quadratic term? > > Many thanks > > Shona > ________________________________________ > From: Gordon K Smyth [smyth at wehi.EDU.AU] > Sent: 10 December 2011 07:10 > To: Wood, Shona > Cc: Bioconductor mailing list > Subject: EdgeR GLM > > Dear Shona, > > Yes, age.L is the log2-fold-change for increasing age. > > However your topTags table is testing for the linear and quadratic terms > together, so it is testing the significance of any differences between the > three age groups. I'd suggest you test whether the quadratic term is > required using > > glmLRT(d, glmfit, coef=3) > > I'd also generally recommend you go on to estimate the trended and tagwise > dispersions, not just the common dispersion. > > Best wishes > Gordon > >> Date: Thu, 8 Dec 2011 17:41:07 +0000 >> From: "Wood, Shona" <shona.wood at="" liverpool.ac.uk=""> >> To: "bioconductor at stat.math.ethz.ch" <bioconductor at="" stat.math.ethz.ch=""> >> Subject: [BioC] EdgeR GLM >> >> Hi, >> >> I am trying to look at gene expression changes with increasing age. I >> have three time points young, middle age and old. I have been using the >> following, ordered factors and coefficient in order to get a fold change >> and FDR for a linear change in gene expression with increasing age. Can >> you look over my code and see if I have done this right? Is age.L the >> fold change in gene expression with increasing age? >> >>> library(edgeR) >>> library(limma) >>> targets<-read.delim (file="coding_targets.txt") >>> targets$age<-factor(ordered (targets$age)) >>> targets$sample<-factor(targets$sample) >>> targets >> X files age sample >> 1 p1 p1_coding_counts.txt a-young 1 >> 2 p2 p2_coding_counts.txt a-young 2 >> 3 p3 p3_coding_counts.txt a-young 3 >> 4 p7 p7_coding_counts.txt b-middle 7 >> 5 p8 p8_coding_counts.txt b-middle 8 >> 6 p9 p9_coding_counts.txt b-middle 9 >> 7 p4 p4_coding_counts.txt c-old 4 >> 8 p5 p5_coding_counts.txt c-old 5 >> 9 p6 p6_coding_counts.txt c-old 6 >> >>> d<-readDGE(targets, skip=1, comment.char='#') >>> colnames(d)<-row.names(targets) >>> d<- calcNormFactors(d) >>> d >> An object of class "DGEList" >> $samples >> X files age sample lib.size norm.factors >> 1 p1 p1_coding_counts.txt a-young 1 3445622 0.9655724 >> 2 p2 p2_coding_counts.txt a-young 2 2696547 0.9902573 >> 3 p3 p3_coding_counts.txt a-young 3 3308099 0.9787044 >> 4 p7 p7_coding_counts.txt b-middle 7 2503479 1.0584660 >> 5 p8 p8_coding_counts.txt b-middle 8 2639127 1.0477893 >> 6 p9 p9_coding_counts.txt b-middle 9 2696547 0.9902573 >> 7 p4 p4_coding_counts.txt c-old 4 3037440 0.9778553 >> 8 p5 p5_coding_counts.txt c-old 5 2647915 1.0144109 >> 9 p6 p6_coding_counts.txt c-old 6 2475370 0.9809077 >> >>$counts >> 1 2 3 4 5 6 7 8 9 >> ENSRNOG00000000007 1287 1285 1041 788 968 1285 1092 1009 960 >> ENSRNOG00000000008 0 0 0 0 1 0 0 1 0 >> ENSRNOG00000000009 0 0 0 3 0 0 1 0 0 >> ENSRNOG00000000010 38 405 50 18 105 405 372 42 282 >> ENSRNOG00000000012 0 0 0 0 0 0 0 0 0 >> 22932 more rows ... >> >>> d<-d[rowSums(d$counts)>9,] >>> design<- model.matrix(~ age, data = targets) >>> design >> (Intercept) age.L age.Q >> 1 1 -7.071068e-01 0.4082483 >> 2 1 -7.071068e-01 0.4082483 >> 3 1 -7.071068e-01 0.4082483 >> 4 1 -7.850462e-17 -0.8164966 >> 5 1 -7.850462e-17 -0.8164966 >> 6 1 -7.850462e-17 -0.8164966 >> 7 1 7.071068e-01 0.4082483 >> 8 1 7.071068e-01 0.4082483 >> 9 1 7.071068e-01 0.4082483 >> attr(,"assign") >> [1] 0 1 1 >> attr(,"contrasts") >> attr(,"contrasts")$age >> [1] "contr.poly" >> >>> d<-estimateGLMCommonDisp(d, design) >>> names(d) >> [1] "samples" "counts" "common.dispersion" >>> glmfit<- glmFit(d, design, dispersion=d$common.dispersion) >>> results<- glmLRT(d, glmfit, coef=c(2,3)) >>> topTags(results) >> Coefficient: age.L age.Q >> logConc age.L age.Q LR P.Value >> ENSRNOG00000011672 -11.27122 -3.2445886 -1.9642177 109.53641 1.638591e-24 >> >> Many Thanks >> >> Shona > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}} ADD REPLYlink written 6.4 years ago by Gordon Smyth33k 0 6.4 years ago by Gordon Smyth33k Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia Gordon Smyth33k wrote: Dear Shona, These all give the same result as my suggestion. It doesn't matter how you express the three ages, as long as they are equally spaced, you will get the same differential expression results. Not just similar, but exactly the same p-values. The coefficient logFC will change depending on how you scale the age. If you multiply the ages by 2, the logFC will divide by 2. Etc. Best wishes Gordon On Fri, 16 Dec 2011, Wood, Shona wrote: Dear Gordon, Thank you for your suggestions I think I may have a solution though I would like to run it by you and anyone else who uses R. I was reading a thread on the mailing list called: "[R] Behavior of ordered factors in glm" and it seems this person has different ages they want to do a linear analysis on (not in edgeR just R). Looking at there solution I have come up with the following: > library(edgeR) > library(limma) > targets<-read.delim (file="coding_targets.txt") > targets$age<-C(targets$age,poly,1) > attr(targets$age,"contrasts") .L a-young -7.071068e-01 b-middle -7.850462e-17 c-old 7.071068e-01 > targets$sample<-factor(targets$sample) > targets X files age sample 1 p1 p1_coding_counts.txt a-young 1 2 p2 p2_coding_counts.txt a-young 2 3 p3 p3_coding_counts.txt a-young 3 4 p7 p7_coding_counts.txt b-middle 7 5 p8 p8_coding_counts.txt b-middle 8 6 p9 p9_coding_counts.txt b-middle 9 7 p4 p4_coding_counts.txt c-old 4 8 p5 p5_coding_counts.txt c-old 5 9 p6 p6_coding_counts.txt c-old 6 > d<-readDGE(targets, skip=1, comment.char='#') > colnames(d)<-row.names(targets) > d<- calcNormFactors(d) > d An object of class "DGEList" $samples X files age sample lib.size norm.factors 1 p1 p1_coding_counts.txt a-young 1 3445622 0.9655724 2 p2 p2_coding_counts.txt a-young 2 2696547 0.9902573 3 p3 p3_coding_counts.txt a-young 3 3308099 0.9787044 4 p7 p7_coding_counts.txt b-middle 7 2503479 1.0584660 5 p8 p8_coding_counts.txt b-middle 8 2639127 1.0477893 6 p9 p9_coding_counts.txt b-middle 9 2696547 0.9902573 7 p4 p4_coding_counts.txt c-old 4 3037440 0.9778553 8 p5 p5_coding_counts.txt c-old 5 2647915 1.0144109 9 p6 p6_coding_counts.txt c-old 6 2475370 0.9809077$counts 1 2 3 4 5 6 7 8 9 ENSRNOG00000000007 1287 1285 1041 788 968 1285 1092 1009 960 ENSRNOG00000000008 0 0 0 0 1 0 0 1 0 ENSRNOG00000000009 0 0 0 3 0 0 1 0 0 ENSRNOG00000000010 38 405 50 18 105 405 372 42 282 ENSRNOG00000000012 0 0 0 0 0 0 0 0 0 22932 more rows ... > d<-d[rowSums(d$counts)>9,] > design<- model.matrix(~ age, data = targets) > design (Intercept) age.L 1 1 -7.071068e-01 2 1 -7.071068e-01 3 1 -7.071068e-01 4 1 -7.850462e-17 5 1 -7.850462e-17 6 1 -7.850462e-17 7 1 7.071068e-01 8 1 7.071068e-01 9 1 7.071068e-01 attr(,"assign") [1] 0 1 attr(,"contrasts") attr(,"contrasts")$age .L a-young -7.071068e-01 b-middle -7.850462e-17 c-old 7.071068e-01 > d<-estimateGLMCommonDisp(d, design) > glmfit<- glmFit(d, design, dispersion=d$common.dispersion) > results<- glmLRT(d, glmfit, coef=c(2)) > topTags(results) Coefficient: age.L logConc logFC LR P.Value FDR ENSRNOG000000xxxxx -11.271529 -1.9336775 55.45465 9.564007e-14 1.453825e-09 ENSRNOG000000xxxxx -10.359443 1.2661223 49.11802 2.410161e-12 1.831843e-08 ENSRNOG000000xxxxx -11.317494 -1.5480925 47.35359 5.926938e-12 3.003180e-08 Please correct me if I am wrong but I think that this is the change in gene expression with increasing age taking into account expression at middle age. I did also have another way of doing it which involved changing age to a number as suggested by you: targets$agenum<-as.numeric (targets$age) targets$agenum<-targets$agenum-1 > design<- model.matrix(~ agenum, data = targets) > design (Intercept) agenum 1 1 0 2 1 0 3 1 0 4 1 1 5 1 1 6 1 1 7 1 2 8 1 2 9 1 2 attr(,"assign") [1] 0 1 And this gives me similar results. Please let me know if you think this is doing what I hope it is i.e. giving my the linear change in gene expression with increasing age whilst taking into account middle age. Many thanks Shona ________________________________________ From: Gordon K Smyth [smyth@wehi.EDU.AU] Sent: 15 December 2011 22:43 To: Wood, Shona Cc: Bioconductor mailing list Subject: Re: EdgeR GLM Dear Shona, One last suggestion. Try typing contrasts(age) This tells you how the coefficients are determined. I would actually suggest you simply put age in as a numericl column (1=young,2=middle,3=old) into your design matrix, instead of defining an ordinal factor. Best wishes Gordon --------------------------------------------- Professor Gordon K Smyth, Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Vic 3052, Australia. Tel: (03) 9345 2326, Fax (03) 9347 0852, smyth at wehi.edu.au http://www.wehi.edu.au http://www.statsci.org/smyth On Fri, 16 Dec 2011, Gordon K Smyth wrote: > Dear Shona, > > Of course linear regression would use all the data points if you had 50 ages, > but you have in fact only three. > > If there are an odd number of equally spaced ages, the middle one will not > contribute to the coefficient. That's just how the math works. > > Polynomial regression isn't an invention of the edgeR package, and how it > works in R is determined by the base functions in R, not those in edgeR. If > you have further questions about linear or quadratic regression itself, it > might be best to read separately, or else to ask for help on a more general > list, probably the main R help list. > > Best wishes > Gordon > > --------------------------------------------- > Professor Gordon K Smyth, > Bioinformatics Division, > Walter and Eliza Hall Institute of Medical Research, > 1G Royal Parade, Parkville, Vic 3052, Australia. > http://www.wehi.edu.au > http://www.statsci.org/smyth > > On Thu, 15 Dec 2011, Wood, Shona wrote: > > Dear Gordon, > > Thank you for getting back to me and giving me those explanations. > > The only thing that is still confusing me is the linear coefficient. If it > compares young and old without regard to middle isn't it basically the same > as a pairwise comparison? Also it seems strange that it only uses the two > extremities - I thought linear regression should use all the data points. For > example if we had 50 different ages would 48 data points be excluded to > compare the first and last only? > > Thanks > > Shona > ________________________________________ > From: Gordon K Smyth [smyth at wehi.EDU.AU] > Sent: 13 December 2011 23:18 > To: Wood, Shona > Cc: Bioconductor mailing list > Subject: RE: EdgeR GLM > > Dear Shona, > > I can only give you brief pointers. > > There are only three time points in your data. The linear coefficient in > the model simply compares old with young, without regard to middle. The > quadratic term compares middle with the average of young and old. So if > expression goes 2-4-2 for a particular gene over the three times then you > get a positive quadratic coefficient and zero linear coefficient. > > It is perfectly possible for the likelihood ratio test of both linear > and quadratic together: > > glmLRT(d, glmfit, coef=2:3) > > to detect some genes not in the separate linear or quadratic lists, > because the linear or quadratic terms might be not quite significant > separately, but might be so when considered together. For example, the > expression might go 2-4-3. Is this linear or quadratic? Not quite sure. > But we might be confident that the three times are not all equal. > > If you only want to report genes that have increasing expression with age, > then I would have thought you want to fit the linear term only, since the > quadratic effect can never be guaranteed to be increasing. > > Best wishes > Gordon > > --------------------------------------------- > Professor Gordon K Smyth, > Bioinformatics Division, > Walter and Eliza Hall Institute of Medical Research, > 1G Royal Parade, Parkville, Vic 3052, Australia. > Tel: (03) 9345 2326, Fax (03) 9347 0852, > smyth at wehi.edu.au > http://www.wehi.edu.au > http://www.statsci.org/smyth > > On Tue, 13 Dec 2011, Wood, Shona wrote: > > Dear Gordon, > > Thank you for getting back to me. I am not a statistician so forgive me if > I am asking basic questions. I was just reading about quadratic equations. > Does a significant hit in the quadratic term mean that the rate of change > first increases and then decreases over time (or visa versa)? Therefore > hits which require the quadratic term are not a linear up or down with age > they are only "linear" after middle age? > > If this is the case what does the logFC for age.Q represent, is it the > fold change between middle age and old? And how is it calculated? Does the > gene expression at young age affect the quadratic calculation? For example > if the expression at young age is 2 and then 4 at middle and then 1.99 at > old would the fact that old age is basically back to the young expression > level stop it being a significant quadratic result or would it be > significant because of the change from middle age. > > I have done the quadratic term separately as suggested but I also did the > linear term separately (glmLRT(d, glmfit, coef=2)). When I compare the > results of both these tests to glmLRT(d, glmfit, coef=2, 3) there are some > significant genes in the glmLRT(d, glmfit, coef=2,3), that aren't in > either of the separate quadratic or linear tests - why would this be? > > Finally if i want to report changes with increasing age would it be best > to present the results from the quadratic and linear tests separately or > would it be better to present the combined test (glmLRT(d, glmfit, coef=2, > 3)) and point out which ones rely on the quadratic term? > > Many thanks > > Shona > ________________________________________ > From: Gordon K Smyth [smyth at wehi.EDU.AU] > Sent: 10 December 2011 07:10 > To: Wood, Shona > Cc: Bioconductor mailing list > Subject: EdgeR GLM > > Dear Shona, > > Yes, age.L is the log2-fold-change for increasing age. > > However your topTags table is testing for the linear and quadratic terms > together, so it is testing the significance of any differences between the > three age groups. I'd suggest you test whether the quadratic term is > required using > > glmLRT(d, glmfit, coef=3) > > I'd also generally recommend you go on to estimate the trended and tagwise > dispersions, not just the common dispersion. > > Best wishes > Gordon > >> Date: Thu, 8 Dec 2011 17:41:07 +0000 >> From: "Wood, Shona" <shona.wood at="" liverpool.ac.uk=""> >> To: "bioconductor at stat.math.ethz.ch" <bioconductor at="" stat.math.ethz.ch=""> >> Subject: [BioC] EdgeR GLM >> >> Hi, >> >> I am trying to look at gene expression changes with increasing age. I >> have three time points young, middle age and old. I have been using the >> following, ordered factors and coefficient in order to get a fold change >> and FDR for a linear change in gene expression with increasing age. Can >> you look over my code and see if I have done this right? Is age.L the >> fold change in gene expression with increasing age? >> >>> library(edgeR) >>> library(limma) >>> targets<-read.delim (file="coding_targets.txt") >>> targets$age<-factor(ordered (targets$age)) >>> targets$sample<-factor(targets$sample) >>> targets >> X files age sample >> 1 p1 p1_coding_counts.txt a-young 1 >> 2 p2 p2_coding_counts.txt a-young 2 >> 3 p3 p3_coding_counts.txt a-young 3 >> 4 p7 p7_coding_counts.txt b-middle 7 >> 5 p8 p8_coding_counts.txt b-middle 8 >> 6 p9 p9_coding_counts.txt b-middle 9 >> 7 p4 p4_coding_counts.txt c-old 4 >> 8 p5 p5_coding_counts.txt c-old 5 >> 9 p6 p6_coding_counts.txt c-old 6 >> >>> d<-readDGE(targets, skip=1, comment.char='#') >>> colnames(d)<-row.names(targets) >>> d<- calcNormFactors(d) >>> d >> An object of class "DGEList" >>$samples >> X files age sample lib.size norm.factors >> 1 p1 p1_coding_counts.txt a-young 1 3445622 0.9655724 >> 2 p2 p2_coding_counts.txt a-young 2 2696547 0.9902573 >> 3 p3 p3_coding_counts.txt a-young 3 3308099 0.9787044 >> 4 p7 p7_coding_counts.txt b-middle 7 2503479 1.0584660 >> 5 p8 p8_coding_counts.txt b-middle 8 2639127 1.0477893 >> 6 p9 p9_coding_counts.txt b-middle 9 2696547 0.9902573 >> 7 p4 p4_coding_counts.txt c-old 4 3037440 0.9778553 >> 8 p5 p5_coding_counts.txt c-old 5 2647915 1.0144109 >> 9 p6 p6_coding_counts.txt c-old 6 2475370 0.9809077 >> >> $counts >> 1 2 3 4 5 6 7 8 9 >> ENSRNOG00000000007 1287 1285 1041 788 968 1285 1092 1009 960 >> ENSRNOG00000000008 0 0 0 0 1 0 0 1 0 >> ENSRNOG00000000009 0 0 0 3 0 0 1 0 0 >> ENSRNOG00000000010 38 405 50 18 105 405 372 42 282 >> ENSRNOG00000000012 0 0 0 0 0 0 0 0 0 >> 22932 more rows ... >> >>> d<-d[rowSums(d$counts)>9,] >>> design<- model.matrix(~ age, data = targets) >>> design >> (Intercept) age.L age.Q >> 1 1 -7.071068e-01 0.4082483 >> 2 1 -7.071068e-01 0.4082483 >> 3 1 -7.071068e-01 0.4082483 >> 4 1 -7.850462e-17 -0.8164966 >> 5 1 -7.850462e-17 -0.8164966 >> 6 1 -7.850462e-17 -0.8164966 >> 7 1 7.071068e-01 0.4082483 >> 8 1 7.071068e-01 0.4082483 >> 9 1 7.071068e-01 0.4082483 >> attr(,"assign") >> [1] 0 1 1 >> attr(,"contrasts") >> attr(,"contrasts")$age >> [1] "contr.poly" >> >>> d<-estimateGLMCommonDisp(d, design) >>> names(d) >> [1] "samples" "counts" "common.dispersion" >>> glmfit<- glmFit(d, design, dispersion=d$common.dispersion) >>> results<- glmLRT(d, glmfit, coef=c(2,3)) >>> topTags(results) >> Coefficient: age.L age.Q >> logConc age.L age.Q LR P.Value >> ENSRNOG00000011672 -11.27122 -3.2445886 -1.9642177 109.53641 1.638591e-24 >> >> Many Thanks >> >> Shona > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:9}}