Question: edgeR - R script - results compared to DESeq
0
7.7 years ago by
Gordon Smyth38k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth38k wrote:
Hi Hilary, > Date: Wed, 30 Nov 2011 08:30:36 -0500 (EST) > From: "Smith, Hilary A" <hilary.smith at="" gatech.edu=""> > To: bioconductor at r-project.org > Subject: [BioC] edgeR - R script - results compared to DESeq > > In case it helps the discussion ... I also tried running GLMs in both > DESeq and edgeR. I likewise found that edgeR yielded more differentially > expressed tags or genes. I know Dr. Gordon Smyth mentioned > calcNormFactors and tagwise dispersion; I did use both of these options. > > If it helps, an abstract description of the model comparison used in > both programs is below (and below that if helpful, full code for edgeR > -- I am using the newest release). I assume the differences are coming > from the ways DESeq and edgeR estimate dispersion, but I'm eager to > learn more about the rationale (especially given I just started using > R/Bioconductor a few weeks ago) and note the results below in case they > are of use in identifying the differences. I am glad to hear this is not > just a factor of my dataset and is a common feature to have edgeR find > more genes. The rationale of the edgeR package is explained in: http://bioinformatics.oxfordjournals.org/content/23/21/2881 The p-values from edgeR are very slightly liberal because it treats the dispersions as known in the testing procedure. I think other packages are also subject this same assumption however. > My models for main effects A and B (with 3 biological reps. each) and their interaction are: > Full: (A + B + A:B) > Reduced1: (A + B) > Reduced 2: (A) > > The comparison of the Full vs. Reduced1 yields the genes impacted by the > interaction term A:B. To obtain genes impacted by the main effect A, I > perform a comparison of Full vs. Reduced1 and a comparison of Full vs. > Reduced 2 -- those genes found at P.adj<0.05 in the Full vs. Reduced 2 > comparison but not found in Full vs. Reduced 1 are what I am noting as > genes impacted by main effect A. (To the best of my knowledge I cannot > simply drop term A and leave in the interaction term, so this is my > attempt to isolate term A. If there's a better way to do this, I'd be > glad to know.). You are correct that you can't remove A and keep A:B. However many statisticians (John Nelder for example) have argued that the concept of a main effect for A is not meaningful in the presence of a nonzero interaction A:B. If A:B is nonzero, then A can actually be interpretted as having two different main effects, one for each level of B. To find genes responding to A in either level of B, you would compare the Full model with just B. This would test for an A effect in either or both levels of B. Alternatively, and this is what I usually recommend, you can perform a separate test for A within each level of B. To do this, you use glmFit to fit the model (B + A:B), which will have four parameters if A and B each have two levels. This is actually the same model as your Full model but parametrized differently. Then glmLRT(d,fit,coef=3) tests for an A effect when B is at its first level, and glmLRT(d,fit,coef=4) tests for A when B is at its second level. BTW, you can perform these same tests after fitting A+B+A:B using the contrast argument of glmLRT, but reparametrizing to B+A:B makes it simpler. Best wishes Gordon > My results for the interaction term are: > edgeR: 173 genes > DESeq: 38 genes > > > For the main effect A: > edgeR: 261 genes > DESeq: 61 genes > **NOTE: For this comparison of term A, of the 61 genes found by DESeq, > about 44 (or ~72%) were also found by edgeR. > > I did have warnings in running DESeq that the Full model GLM didn't > converge which is disconcerting... edgeR didn't give these warnings but > still found more components. > > Best, > Hilary > > > ~~In the code below, main effect "A" above is "Season," and main effect "B" above is "Hydroperiod." > >> library(edgeR) >> library(limma) >> raw.data = read.csv("2011.11.14counts.csv", header=TRUE, stringsAsFactors=FALSE) >> d = raw.data[,2:13] >> rownames(d) = raw.data[,1] >> head(d) > X1E_R X2E_R X3E_R X1P_R X2P_R X3P_R X1E_F X2E_F X3E_F X1P_F X2P_F > comp0 1159 1572 1817 605 1113 1732 1065 1207 1477 1841 1915 > comp1 534 675 739 236 451 799 544 341 333 690 502 > comp10 37677 54466 58271 34712 40312 51243 30423 28044 23961 53852 59300 > > comp100 1065 1332 1620 658 861 1370 1060 999 918 1697 1117 > comp1000 157 266 247 135 188 244 130 229 141 263 182 > comp10000 14 37 47 17 21 64 35 15 10 28 22 > X3P_F > comp0 1645 > comp1 571 > comp10 44575 > comp100 1336 > comp1000 168 > comp10000 12 > >> Hydroperiod = factor(c("E", "E", "E", "P", "P", "P", "E", "E", "E", "P", "P", "P")) >> Season = factor(c("R", "R", "R", "R", "R", "R", "F", "F", "F", "F", "F", "F")) >> design = model.matrix(~Hydroperiod + Season + Hydroperiod:Season) >> design > (Intercept) HydroperiodP SeasonR HydroperiodP:SeasonR > 1 1 0 1 0 > 2 1 0 1 0 > 3 1 0 1 0 > 4 1 1 1 1 > 5 1 1 1 1 > 6 1 1 1 1 > 7 1 0 0 0 > 8 1 0 0 0 > 9 1 0 0 0 > 10 1 1 0 0 > 11 1 1 0 0 > 12 1 1 0 0 > attr(,"assign") > [1] 0 1 2 3 > attr(,"contrasts") > attr(,"contrasts")$Hydroperiod > [1] "contr.treatment" > > attr(,"contrasts")$Season > [1] "contr.treatment" > >> d.GLM = DGEList(d, group = c("ER", "ER", "ER", "PR", "PR", "PR", "EF", "EF", "EF", "PF", "PF", "PF")) > Calculating library sizes from column totals. >> d.GLM = calcNormFactors(d.GLM) >> d.GLM > An object of class "DGEList" > $samples > group lib.size norm.factors > X1E_R ER 23295633 0.9559226 > X2E_R ER 25882545 1.1040337 > X3E_R ER 29401480 1.0236513 > X1P_R PR 20877015 0.8199915 > X2P_R PR 26649613 0.8869479 > 7 more rows ... > >$counts > X1E_R X2E_R X3E_R X1P_R X2P_R X3P_R X1E_F X2E_F X3E_F X1P_F X2P_F > comp0 1159 1572 1817 605 1113 1732 1065 1207 1477 1841 1915 > comp1 534 675 739 236 451 799 544 341 333 690 502 > comp10 37677 54466 58271 34712 40312 51243 30423 28044 23961 53852 59300 > comp100 1065 1332 1620 658 861 1370 1060 999 918 1697 1117 > comp1000 157 266 247 135 188 244 130 229 141 263 182 > X3P_F > comp0 1645 > comp1 571 > comp10 44575 > > comp100 1336 > comp1000 168 > 25055 more rows ... > > $all.zeros > comp0 comp1 comp10 comp100 comp1000 > FALSE FALSE FALSE FALSE FALSE > 25055 more elements ... > >> nrow(d.GLM) > [1] 25060 >> dim(d.GLM) > [1] 25060 12 > > >> design > (Intercept) HydroperiodP SeasonR HydroperiodP:SeasonR > 1 1 0 1 0 > 2 1 0 1 0 > 3 1 0 1 0 > 4 1 1 1 1 > 5 1 1 1 1 > 6 1 1 1 1 > 7 1 0 0 0 > 8 1 0 0 0 > 9 1 0 0 0 > 10 1 1 0 0 > 11 1 1 0 0 > 12 1 1 0 0 > attr(,"assign") > [1] 0 1 2 3 > attr(,"contrasts") > attr(,"contrasts")$Hydroperiod > [1] "contr.treatment" > > attr(,"contrasts")$Season > [1] "contr.treatment" > >> d.GLM = estimateGLMCommonDisp(d.GLM, design) >> names(d.GLM) > [1] "samples" "counts" "all.zeros" > [4] "common.dispersion" >> d.GLM$common.dispersion > [1] 0.1488192 >> sqrt(d.GLM$common.dispersion) > [1] 0.3857709 >> # 0.3857709 is the Coefficient of Biological Variation >> d.GLM = estimateGLMTrendedDisp(d.GLM, design) > Loading required package: splines >> summary(d.GLM$trended.dispersion) > Min. 1st Qu. Median Mean 3rd Qu. Max. > 0.07541 0.10240 0.19030 0.25500 0.37530 1.26900 >> d.GLM = estimateGLMTagwiseDisp(d.GLM, design) >> d.GLM$prior.n > NULL >> d$prior.n > NULL >> ls() > [1] "Hydroperiod" "Season" "d" "d.GLM" "design" > [6] "raw.data" >> d.GLM > An object of class "DGEList" > $samples > group lib.size norm.factors > X1E_R ER 23295633 0.9559226 > X2E_R ER 25882545 1.1040337 > X3E_R ER 29401480 1.0236513 > > X1P_R PR 20877015 0.8199915 > X2P_R PR 26649613 0.8869479 > 7 more rows ... > >$counts > X1E_R X2E_R X3E_R X1P_R X2P_R X3P_R X1E_F X2E_F X3E_F X1P_F X2P_F > comp0 1159 1572 1817 605 1113 1732 1065 1207 1477 1841 1915 > comp1 534 675 739 236 451 799 544 341 333 690 502 > comp10 37677 54466 58271 34712 40312 51243 30423 28044 23961 53852 59300 > comp100 1065 1332 1620 658 861 1370 1060 999 918 1697 1117 > comp1000 157 266 247 135 188 244 130 229 141 263 182 > X3P_F > comp0 1645 > comp1 571 > comp10 44575 > comp100 1336 > comp1000 168 > 25055 more rows ... > > $all.zeros > comp0 comp1 comp10 comp100 comp1000 > FALSE FALSE FALSE FALSE FALSE > 25055 more elements ... > >$common.dispersion > [1] 0.1488192 > > $trended.dispersion > [1] 0.07619372 0.08679353 0.10444478 0.07739597 0.10629516 > 25055 more elements ... > >$abundance > comp0 comp1 comp10 comp100 comp1000 > -9.737514 -10.720757 -6.331670 -9.937984 -11.724981 > 25055 more elements ... > > $bin.dispersion > [1] 0.7340070 0.6589801 0.5901124 0.5500868 0.4865594 > 22 more elements ... > >$bin.abundance > [1] -17.08166 -16.46619 -16.13033 -15.84028 -15.59074 > 22 more elements ... > > $tagwise.dispersion > [1] 0.06348981 0.06769437 0.07385569 0.05410498 0.08450404 > 25055 more elements ... > >> ?getPriorN >> getPriorN(d.GLM, design=design) > [1] 2.5 >> head(d.GLM$tagwise.dispersion) > [1] 0.06348981 0.06769437 0.07385569 0.05410498 0.08450404 0.19604508 >> summary(d.GLM$tagwise.dispersion) > Min. 1st Qu. Median Mean 3rd Qu. Max. > 0.05139 0.09363 0.18700 0.25910 0.35480 1.71800 >> glmfit.tgw = glmFit(d.GLM, design, dispersion=d.GLM$tagwise.dispersion) >> lrt.tgw = glmLRT(d.GLM, glmfit.tgw) >> topTags(lrt.tgw) > Coefficient: HydroperiodP:SeasonR > logConc logFC LR P.Value FDR > comp13665 -10.932729 1.068620e+01 59.37370 1.304000e-14 3.267824e-10 > comp15478 -12.077538 7.742379e+00 47.67249 5.037112e-12 5.552272e-08 > comp370 -13.588446 1.442695e+08 46.86820 7.592458e-12 5.552272e-08 > > > comp10848 -11.836655 8.970575e+00 46.56512 8.862366e-12 5.552272e-08 > comp13403 -9.315242 3.773345e+00 44.92234 2.050057e-11 1.027488e-07 > comp7180 -11.518762 7.869625e+00 43.18096 4.990399e-11 2.084323e-07 > comp2502 -10.479763 5.723705e+00 41.56745 1.138735e-10 4.076673e-07 > comp2740 -10.814075 4.666231e+00 38.00241 7.065735e-10 2.213342e-06 > comp4314 -13.104853 4.837400e+00 35.44570 2.622602e-09 7.302491e-06 > comp13675 -12.930253 4.442508e+00 34.46100 4.348772e-09 1.089802e-05 >> summary(decideTestsDGE(lrt.tgw)) > [,1] > -1 57 > 0 24887 > 1 116 > > >> design > (Intercept) HydroperiodP SeasonR HydroperiodP:SeasonR > 1 1 0 1 0 > 2 1 0 1 0 > 3 1 0 1 0 > 4 1 1 1 1 > 5 1 1 1 1 > 6 1 1 1 1 > 7 1 0 0 0 > 8 1 0 0 0 > 9 1 0 0 0 > 10 1 1 0 0 > 11 1 1 0 0 > 12 1 1 0 0 > attr(,"assign") > [1] 0 1 2 3 > attr(,"contrasts") > attr(,"contrasts")$Hydroperiod > [1] "contr.treatment" > > attr(,"contrasts")$Season > [1] "contr.treatment" > >> lrt.coef4 = glmLRT(d.GLM, glmfit.tgw, coef=4) >> topTags(lrt.coef4) > Coefficient: HydroperiodP:SeasonR > logConc logFC LR P.Value FDR > comp13665 -10.932729 1.068620e+01 59.37370 1.304000e-14 3.267824e-10 > comp15478 -12.077538 7.742379e+00 47.67249 5.037112e-12 5.552272e-08 > comp370 -13.588446 1.442695e+08 46.86820 7.592458e-12 5.552272e-08 > comp10848 -11.836655 8.970575e+00 46.56512 8.862366e-12 5.552272e-08 > comp13403 -9.315242 3.773345e+00 44.92234 2.050057e-11 1.027488e-07 > comp7180 -11.518762 7.869625e+00 43.18096 4.990399e-11 2.084323e-07 > comp2502 -10.479763 5.723705e+00 41.56745 1.138735e-10 4.076673e-07 > comp2740 -10.814075 4.666231e+00 38.00241 7.065735e-10 2.213342e-06 > comp4314 -13.104853 4.837400e+00 35.44570 2.622602e-09 7.302491e-06 > comp13675 -12.930253 4.442508e+00 34.46100 4.348772e-09 1.089802e-05 >> lrt.coef34 = glmLRT(d.GLM, glmfit.tgw, coef=3:4) >> topTags(lrt.coef34) > Coefficient: SeasonR HydroperiodP:SeasonR > logConc SeasonR HydroperiodP.SeasonR LR P.Value > comp11779 -13.60929 1.442695e+08 -1.442695e+08 88.20296 7.030231e-20 > comp21414 -13.64545 5.616807e+00 1.442695e+08 71.78587 2.581642e-16 > comp6411 -10.57883 1.671168e+00 2.006498e+00 67.75411 1.938124e-15 > comp6417 -10.10518 1.510699e+00 1.224445e+00 65.09872 7.311274e-15 > comp13665 -10.93273 -2.193560e+00 1.068620e+01 62.70305 2.422182e-14 > comp1872 -12.53141 3.893662e+00 -2.007081e+00 62.49057 2.693670e-14 > comp5005 -11.13142 4.565629e-01 3.063432e+00 61.87012 3.673456e-14 > comp15150 -13.28156 5.032869e+00 1.442695e+08 58.96222 1.572234e-13 > comp2502 -10.47976 -4.007870e-01 5.723705e+00 56.89575 4.418199e-13 > comp19402 -11.60722 5.375493e+00 -5.180038e+00 52.82636 3.379876e-12 > FDR > comp11779 1.761776e-15 > comp21414 3.234797e-12 > comp6411 1.618980e-11 > comp6417 4.580513e-11 > comp13665 1.125056e-10 > comp1872 1.125056e-10 > comp5005 1.315097e-10 > comp15150 4.925022e-10 > comp2502 1.230223e-09 > comp19402 8.469969e-09 >> ls() > [1] "Hydroperiod" "Season" "d" "d.GLM" "design" > > > [6] "glmfit.tgw" "lrt.coef34" "lrt.coef4" "lrt.tgw" "oo" > [11] "raw.data" >> head(lrt.coef34) > An object of class "DGELRT" > $samples > group lib.size norm.factors > X1E_R ER 23295633 0.9559226 > X2E_R ER 25882545 1.1040337 > X3E_R ER 29401480 1.0236513 > X1P_R PR 20877015 0.8199915 > X2P_R PR 26649613 0.8869479 > 7 more rows ... > >$all.zeros > comp0 comp1 comp10 comp100 comp1000 comp10000 > FALSE FALSE FALSE FALSE FALSE FALSE > > $common.dispersion > [1] 0.1488192 > >$trended.dispersion > [1] 0.07619372 0.08679353 0.10444478 0.07739597 0.10629516 0.20365547 > > $abundance > comp0 comp1 comp10 comp100 comp1000 comp10000 > -9.737514 -10.720757 -6.331670 -9.937984 -11.724981 -13.712600 > >$bin.dispersion > [1] 0.7340070 0.6589801 0.5901124 0.5500868 0.4865594 > 22 more elements ... > > $bin.abundance > [1] -17.08166 -16.46619 -16.13033 -15.84028 -15.59074 > 22 more elements ... > >$tagwise.dispersion > [1] 0.06348981 0.06769437 0.07385569 0.05410498 0.08450404 0.19604508 > > $coef > [1] 4 > >$table > logConc logFC.SeasonR logFC.HydroperiodP.SeasonR LR.statistic > comp0 -9.736436 -0.24995924 -0.3214735 4.34114504 > comp1 -10.739594 0.20037356 -0.3870300 0.77512680 > comp10 -6.341516 0.37480089 -0.5276106 1.59407277 > comp100 -9.940774 -0.06226595 -0.3360123 2.12228134 > comp1000 -11.726085 -0.07140854 0.1131265 0.05488506 > comp10000 -13.750173 0.21120837 0.5708073 1.99318234 > p.value > comp0 0.1141123 > comp1 0.6787086 > comp10 0.4506626 > comp100 0.3460608 > comp1000 0.9729306 > comp10000 0.3691356 > > $coefficients.full > (Intercept) HydroperiodP SeasonR HydroperiodP:SeasonR > comp0 -9.620221 0.028007573 -0.17325854 -0.22282846 > comp1 -10.774172 0.058157626 0.13888837 -0.26826873 > comp10 -6.555230 0.335378111 0.25979218 -0.36571177 > comp100 -9.871905 0.009775718 -0.04315947 -0.23290600 > comp1000 -11.662643 -0.118031564 -0.04949663 0.07841331 > comp10000 -13.805669 -0.272566725 0.14639848 0.39565348 > >$coefficients.null > (Intercept) HydroperiodP > comp0 -9.703284 -0.06737999 > comp1 -10.701998 -0.07653693 > comp10 -6.416913 0.14550436 > comp100 -9.893311 -0.09719967 > comp1000 -11.687332 -0.07884619 > comp10000 -13.727706 -0.04514128 > > $design.full > (Intercept) HydroperiodP SeasonR HydroperiodP:SeasonR > 1 1 0 1 0 > 2 1 0 1 0 > 3 1 0 1 0 > 4 1 1 1 1 > 5 1 1 1 1 > 7 more rows ... > >$design.null > (Intercept) HydroperiodP > 1 1 0 > 2 1 0 > 3 1 0 > 4 1 1 > 5 1 1 > 7 more rows ... > > $dispersion.used > [1] 0.06348981 0.06769437 0.07385569 0.05410498 0.08450404 0.19604508 > >$comparison > [1] "SeasonR" "HydroperiodP:SeasonR" > >> > RemoveSeasonRmHydBySeason = topTags(lrt.coef34, number =25060) > Error in topTags(lrt.coef34, number = 25060) : > unused argument(s) (number = 25060) >> ?topTags >> RemoveSeasonRmHydBySeason = topTags(lrt.coef34, n =25060) >> ls() > [1] "Hydroperiod" "RemoveSeasonRmHydBySeason" > [3] "Season" "d" > [5] "d.GLM" "design" > [7] "glmfit.tgw" "lrt.coef34" > [9] "lrt.coef4" "lrt.tgw" > [11] "oo" "raw.data" >> RemoveHydBySeason = topTags(lrt.coef4, n=25060) >> write.csv(RemoveHydBySeason, "RemoveHydBySeason.csv") >> write.csv(RemoveSeasonRmHydBySeason, "RemoveSeasonRmHydBySeason.csv") >> summary(decideTestsDGE(lrt.coef4) > + >> summary(decideTestsDGE(lrt.coef4)) > [,1] > -1 57 > 0 24887 > 1 116 >> summary(decideTestsDGE(lrt.coef34)) > Error in array(x, c(length(x), 1L), if (!is.null(names(x))) list(names(x), : > attempt to set an attribute on NULL >> design > (Intercept) HydroperiodP SeasonR HydroperiodP:SeasonR > 1 1 0 1 0 > 2 1 0 1 0 > 3 1 0 1 0 > 4 1 1 1 1 > 5 1 1 1 1 > 6 1 1 1 1 > 7 1 0 0 0 > 8 1 0 0 0 > 9 1 0 0 0 > 10 1 1 0 0 > 11 1 1 0 0 > 12 1 1 0 0 > attr(,"assign") > [1] 0 1 2 3 > attr(,"contrasts") > attr(,"contrasts")$Hydroperiod > [1] "contr.treatment" > > attr(,"contrasts")$Season > [1] "contr.treatment" ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}
glad edger deseq • 617 views