edgeR: paired samples DGE and GLM

0

Entering edit mode

Guest User ★ 13k

@guest-user-4897

Last seen 9.6 years ago

Hi, I am trying to use edgeR to compute differential gene expression. I have a quite simple experimental design: labExpId Patient sex Treatment sample.4 1 F A sample.3 2 M A sample.5 2 M B sample.7 3 F A sample.0 4 M A sample.8 3 F B sample.1 4 M B sample.2 1 F B I am comparing the effect of treatment across all patients, and it is fine when I apply exactTest. Then I wanted to include also the Patient in my model and I try using GLM, but I found several problems. So, I tried to apply GLM only including the Treatment just to troubleshoot. I follow the instructions on page 22 from the user's guide revised in December 2012. My counts are stored in a matrix called m. >colnames(m) [1] "sample.2" "sample.3" "sample.4" "sample.5" "sample.7" sample.8" [7] "sample.0" "sample.1" >design=model.matrix(~0+Treatment) > design TreatmentA TreatmentB sample.4 1 0 sample.3 1 0 sample.5 0 1 sample.7 1 0 sample.0 1 0 sample.8 0 1 sample.1 0 1 sample.2 0 1 >M = DGEList(counts=na.omit(m)) >cpm.m <- cpm(M) >M <- M[rowSums(cpm.m >1) >=1,] >M <- calcNormFactors(M) >M <- estimateGLMCommonDisp(M, design, verbose=T) > names(M) [1] "counts" "samples" "abundance" [4] "logCPM" "common.dispersion" First thing I notice, I don't see pseudo.counts and other attributes I saw when calling estimateCommonDisp(). > M <- estimateGLMTrendedDisp(M, design) > M <- estimateGLMTagwiseDisp(M, design) > fit <- glmFit(M, design) > lrt <- glmLRT(fit, contrast=c(-1,1)) > res = topTags(lrt, n=nrow(lrt))$table > head(res) logFC logCPM LR PValue FDR ENSG00000244115 15.512665 2.266789 35.30878 2.813602e-09 3.482958e-05 ENSG00000256329 -17.760869 4.514903 32.89653 9.719681e-09 6.015996e-05 ENSG00000223638 11.777617 -1.467638 18.46673 1.728967e-05 7.134295e-02 ENSG00000134321 -5.328058 5.959204 16.76318 4.234703e-05 1.310535e-01 ENSG00000196861 7.776972 -1.722111 15.83143 6.924259e-05 1.494412e-01 ENSG00000229292 11.368590 -1.876601 15.74621 7.243290e-05 1.494412e-01 > m[rownames(head(res)),] sample.2 sample.3 sample.4 sample.5 sample.7 sample.8 ENSG00000244115 0 0 0 0 0 6412 ENSG00000256329 0 0 0 32070 0 0 ENSG00000223638 0 0 2 0 0 0 ENSG00000134321 1178 590 890 83130 754 1116 ENSG00000196861 0 3 363 0 0 53 ENSG00000229292 0 0 0 0 0 0 sample.0 sample.1 ENSG00000244115 0 4415 ENSG00000256329 0 0 ENSG00000223638 434 550 ENSG00000134321 852 1092 ENSG00000196861 500 57 ENSG00000229292 344 406 I realized that it may be a problem of the ordering of the design matrix rows > colnames(m) [1] "sample.2" "sample.3" "sample.4" "sample.5" "sample.7" "sample.8" [7] "sample.0" "sample.1" > rownames(design) [1] "sample.4" "sample.3" "sample.5" "sample.7" "sample.0" "sample.8" [7] "sample.1" "sample.2" so I redo it forcing the row names of my design matrix to be the same order as the column names in the matrix count. Now my design matrix looks like: > design TreatmentA TreatmentB sample.2 0 1 sample.3 1 0 sample.4 1 0 sample.5 0 1 sample.7 1 0 sample.8 0 1 sample.0 1 0 sample.1 0 1 I execute exactly the same commands as before. Again I cannot see the pseudocounts. > names(M) [1] "counts" "samples" "abundance" [4] "logCPM" "common.dispersion" > head(res) logFC logCPM LR PValue FDR ENSG00000074855 -32.17749 0.2759965 2121.104 0 0 ENSG00000076351 -32.17749 -0.4306713 1549.345 0 0 ENSG00000105552 -32.17749 -0.4864164 1502.658 0 0 ENSG00000106991 -32.17749 0.4622641 2144.947 0 0 ENSG00000123009 -32.17749 -0.2422835 1719.625 0 0 ENSG00000133216 -32.17749 0.2206127 2127.575 0 0 I get very different results still from the exactTest with no GLM. I attribute this to the missing pseudocounts in the DGEList object? > m[rownames(head(res)),] sample.2 sample.3 sample.4 sample.5 sample.7 sample.8 ENSG00000074855 0 948 902 0 676 0 ENSG00000076351 0 494 522 0 538 0 ENSG00000105552 0 490 454 0 486 0 ENSG00000106991 0 1094 798 0 698 0 ENSG00000123009 0 556 520 0 566 0 ENSG00000133216 0 808 762 0 730 0 sample.0 sample.1 ENSG00000074855 1166 0 ENSG00000076351 698 0 ENSG00000105552 764 0 ENSG00000106991 1733 0 ENSG00000123009 980 0 ENSG00000133216 1304 0 I see that these genes have all zero counts for TreatmentB and this explains the low logFC, but I don't get this when I don't use GLM at all (exactTest). So in summary, my questions are: 1) Does the ordering of rows in the design matrix have to be the same as the column order in the count matrix, or the sample name is taken into account so the order should not matter? 2) Do you have any idea why I don't get pseudocounts after estimating the dispersion, and is it really this that causes such low logFC? I am looking forward to hearing from you. Many thanks in advance, Ale -- output of sessionInfo(): R version 2.15.0 (2012-03-30) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] C attached base packages: [1] splines stats graphics grDevices utils datasets methods [8] base other attached packages: [1] plyr_1.7.1 optparse_1.0.1 ggplot2_0.9.3.1 reshape2_1.2.1 [5] edgeR_3.0.8 limma_3.14.4 loaded via a namespace (and not attached): [1] MASS_7.3-17 RColorBrewer_1.0-5 colorspace_1.1-1 dichromat_1.2-4 [5] digest_0.5.2 getopt_1.19.1 grid_2.15.0 gtable_0.1.2 [9] labeling_0.1 munsell_0.3 proto_0.3-9.2 scales_0.2.3 [13] stringr_0.6 tools_2.15.0 -- Sent via the guest posting facility at bioconductor.org.

edgeR edgeR • 3.1k views

ADD COMMENT • link updated 11.1 years ago by Ryan C. Thompson ★ 7.9k • written 11.1 years ago by Guest User ★ 13k

0

Entering edit mode

Ryan C. Thompson ★ 7.9k

@ryan-c-thompson-5618

Last seen 8 months ago

Scripps Research, La Jolla, CA

Hi, See this previous discussion for a discussion of the pseudocounts: https://stat.ethz.ch/pipermail/bioconductor/2013-March/051182.html Only the non-GLM method makes use of pseudocounts, so it's not important that they are missing from the GLM analysis. I believe the design rows and count columns *do* need to match up in the same order. To do a paired test, you can just use model.matrix(~0+Treatment+Patient) and proceed as you already have. The GLM method does give non-identical results, and in general it seems that people are assuming that the GLM method is better overall, though as always, proof is scarce when sequencing is so expensive. Hope this answers your questions. -Ryan Thompson On 03/12/2013 04:20 AM, Alessandra [guest] wrote: > Hi, > > I am trying to use edgeR to compute differential gene expression. > > I have a quite simple experimental design: > labExpId Patient sex Treatment > sample.4 1 F A > sample.3 2 M A > sample.5 2 M B > sample.7 3 F A > sample.0 4 M A > sample.8 3 F B > sample.1 4 M B > sample.2 1 F B > > > I am comparing the effect of treatment across all patients, and it is fine when I apply exactTest. Then I wanted to include also the Patient in my model and I try using GLM, but I found several problems. So, I tried to apply GLM only including the Treatment just to troubleshoot. I follow the instructions on page 22 from the user's guide revised in December 2012. > > My counts are stored in a matrix called m. > >> colnames(m) > [1] "sample.2" "sample.3" "sample.4" "sample.5" "sample.7" sample.8" > [7] "sample.0" "sample.1" > >> design=model.matrix(~0+Treatment) >> design > TreatmentA TreatmentB > sample.4 1 0 > sample.3 1 0 > sample.5 0 1 > sample.7 1 0 > sample.0 1 0 > sample.8 0 1 > sample.1 0 1 > sample.2 0 1 > >> M = DGEList(counts=na.omit(m)) >> cpm.m <- cpm(M) >> M <- M[rowSums(cpm.m >1) >=1,] >> M <- calcNormFactors(M) >> M <- estimateGLMCommonDisp(M, design, verbose=T) >> names(M) > [1] "counts" "samples" "abundance" > [4] "logCPM" "common.dispersion" > > First thing I notice, I don't see pseudo.counts and other attributes I saw when calling estimateCommonDisp(). > >> M <- estimateGLMTrendedDisp(M, design) >> M <- estimateGLMTagwiseDisp(M, design) >> fit <- glmFit(M, design) >> lrt <- glmLRT(fit, contrast=c(-1,1)) >> res = topTags(lrt, n=nrow(lrt))$table >> head(res) > logFC logCPM LR PValue FDR > ENSG00000244115 15.512665 2.266789 35.30878 2.813602e-09 3.482958e-05 > ENSG00000256329 -17.760869 4.514903 32.89653 9.719681e-09 6.015996e-05 > ENSG00000223638 11.777617 -1.467638 18.46673 1.728967e-05 7.134295e-02 > ENSG00000134321 -5.328058 5.959204 16.76318 4.234703e-05 1.310535e-01 > ENSG00000196861 7.776972 -1.722111 15.83143 6.924259e-05 1.494412e-01 > ENSG00000229292 11.368590 -1.876601 15.74621 7.243290e-05 1.494412e-01 > >> m[rownames(head(res)),] > sample.2 sample.3 sample.4 sample.5 sample.7 sample.8 > ENSG00000244115 0 0 0 0 0 6412 > ENSG00000256329 0 0 0 32070 0 0 > ENSG00000223638 0 0 2 0 0 0 > ENSG00000134321 1178 590 890 83130 754 1116 > ENSG00000196861 0 3 363 0 0 53 > ENSG00000229292 0 0 0 0 0 0 > > sample.0 sample.1 > ENSG00000244115 0 4415 > ENSG00000256329 0 0 > ENSG00000223638 434 550 > ENSG00000134321 852 1092 > ENSG00000196861 500 57 > ENSG00000229292 344 406 > > I realized that it may be a problem of the ordering of the design matrix rows > >> colnames(m) > [1] "sample.2" "sample.3" "sample.4" "sample.5" "sample.7" "sample.8" > [7] "sample.0" "sample.1" >> rownames(design) > [1] "sample.4" "sample.3" "sample.5" "sample.7" "sample.0" "sample.8" > [7] "sample.1" "sample.2" > > so I redo it forcing the row names of my design matrix to be the same order as the column names in the matrix count. > Now my design matrix looks like: > >> design > TreatmentA TreatmentB > sample.2 0 1 > sample.3 1 0 > sample.4 1 0 > sample.5 0 1 > sample.7 1 0 > sample.8 0 1 > sample.0 1 0 > sample.1 0 1 > > I execute exactly the same commands as before. > Again I cannot see the pseudocounts. >> names(M) > [1] "counts" "samples" "abundance" > [4] "logCPM" "common.dispersion" > >> head(res) > logFC logCPM LR PValue FDR > ENSG00000074855 -32.17749 0.2759965 2121.104 0 0 > ENSG00000076351 -32.17749 -0.4306713 1549.345 0 0 > ENSG00000105552 -32.17749 -0.4864164 1502.658 0 0 > ENSG00000106991 -32.17749 0.4622641 2144.947 0 0 > ENSG00000123009 -32.17749 -0.2422835 1719.625 0 0 > ENSG00000133216 -32.17749 0.2206127 2127.575 0 0 > > I get very different results still from the exactTest with no GLM. > I attribute this to the missing pseudocounts in the DGEList object? > >> m[rownames(head(res)),] > sample.2 sample.3 sample.4 sample.5 sample.7 sample.8 > ENSG00000074855 0 948 902 0 676 0 > ENSG00000076351 0 494 522 0 538 0 > ENSG00000105552 0 490 454 0 486 0 > ENSG00000106991 0 1094 798 0 698 0 > ENSG00000123009 0 556 520 0 566 0 > ENSG00000133216 0 808 762 0 730 0 > sample.0 sample.1 > ENSG00000074855 1166 0 > ENSG00000076351 698 0 > ENSG00000105552 764 0 > ENSG00000106991 1733 0 > ENSG00000123009 980 0 > ENSG00000133216 1304 0 > > I see that these genes have all zero counts for TreatmentB and this explains the low logFC, but I don't get this when I don't use GLM at all (exactTest). > > So in summary, my questions are: > > 1) Does the ordering of rows in the design matrix have to be the same as the column order in the count matrix, > or the sample name is taken into account so the order should not matter? > > 2) Do you have any idea why I don't get pseudocounts after estimating the dispersion, and is it really this that causes such low logFC? > > I am looking forward to hearing from you. > > Many thanks in advance, > Ale > > > -- output of sessionInfo(): > > R version 2.15.0 (2012-03-30) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] C > > attached base packages: > [1] splines stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] plyr_1.7.1 optparse_1.0.1 ggplot2_0.9.3.1 reshape2_1.2.1 > [5] edgeR_3.0.8 limma_3.14.4 > > loaded via a namespace (and not attached): > [1] MASS_7.3-17 RColorBrewer_1.0-5 colorspace_1.1-1 dichromat_1.2-4 > [5] digest_0.5.2 getopt_1.19.1 grid_2.15.0 gtable_0.1.2 > [9] labeling_0.1 munsell_0.3 proto_0.3-9.2 scales_0.2.3 > [13] stringr_0.6 tools_2.15.0 > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 11.1 years ago Ryan C. Thompson ★ 7.9k

Login before adding your answer.