Question

Gene expression analysis with edgeR with a large, nested design matrix

0

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 1 hour ago

WEHI, Melbourne, Australia

Dear Uli, edgeR is probably the fastest of the glm negative binomial packages, as we have done a lot of work moving all the fitting code to the C++ level. Nevertheless, the ususal glm pipeline will become very slow when there are over 3000 samples and the design matrix has 1500 columns. Here are two other possibilities. First you could switch to voom instead. I ran voom on your simulated data example in a few minutes on my laptop computer: y <- DGEList(counts = reads) y <- calcNormFactors(y) v <- voom(y,expdesign,plot=TRUE) fit <- lmFit(v,expdesign) fit <- eBayes(fit) topTable(fit, coef=5) and so on. Alternatively, you would stick to edgeR and use the quasi-likelihood pipeline: y <- DGEList(counts = reads) y <- calcNormFactors(y) y <- estimateGLMCommonDisp(y, design=expdesign, method="Pearson") ql <- glmQLFTest(y,expdesign) topTags(ql,coef=5) etc. The glmQLFTest() function will take more time than voom but less time than estimateGLMCcommonDispersion on your data. The common dispersion step is optional in this pipeline -- I include it because you've already done it. Either of these approaches have good statistical motivations. See the help pages for voom and glmQLFTest for references to published papers describing them. Best wishes Gordon > Date: Fri, 23 May 2014 09:06:21 -0700 (PDT) > From: "Uli Braunschweig [guest]" <guest at="" bioconductor.org=""> > To: bioconductor at r-project.org, u.braunschweig at utoronto.ca > Subject: [BioC] Gene expression analysis with edgeR with a large, > nested design matrix > > Hi All, > > My problem is the following: > I have expression counts for 50 genes in an RNAi screen with 1,536 treatments (which includes positive and negative controls, so really 1,416 unique treatments) in two replicates, done in 96-well format (2x16 plates). I know that plate effects and edge effects (whether a well was located on the edge of a plate) are significant, so the design should include treatment, plate, and location (edge or interior). Locations are identical between replicates. Each plate has two negative controls ("siNT"), as well as other controls. > > I am only interested in the contrasts of each of the treatments vs. the "siNT" control. I thought that the model of edgeR would be useful to score significant hits while at the same time dealing with the mentioned technical biases in a meaningful manner. However, I've had to kill the analysis because estimating the trended and tag-wise dispersions takes excessively long. > > My question is: Is it even feasible to try to adress a problem with a design that has so few genes and so many treatments using edgeR (or DESeq2)? > > library(edgeR) > > ## Data look similar to this: > reads <- matrix(round(2000 * rexp(50 * 3072)), nrow=50) # dense matrix of 50 genes x (1536 treatments in duplicate) > > ## Here is how I create my design factors: > # (I've left out 'replicate' because it is a linear combination of plates 1-16 and 17-32) > rows <- rep(rep(1:8, each=12), 32) > cols <- rep(rep(1:12, 8), 32) > plate <- rep(1:32, each=96) > > type.nt <- rows == 1 & cols == 1 | # the negative control to compare everything to; 2 per plate > rows == 4 & cols == 9 > type.posCtl <- rows == 4 & cols == 3 | # positive control; 2 per plate > rows == 5 & cols == 9 > type.mock <- rows == 3 & cols == 3 | # another control; 2 per plate > rows == 8 & cols == 12 > type.empty <- plate %in% c(16, 32) & ( # another control; a bunch on only 2 plates > rows %in% c(2:3,6:8) & cols == 9 | > cols %in% c(10,11) | > rows %in% 1:7 & cols == 12 > ) > type.edge <- rows %in% c(1,8) | cols %in% c(1,12) # position on the plate > > treat <- rep(paste("T", 1:1536, sep=""), 2) # treatments > treat[type.nt] <- "siNT" > treat[type.mock] <- "mock" > treat[type.empty] <- "empty" > treat[type.posCtl] <- "siPosCtl" > treatfac <- relevel(as.factor(treat), ref="siNT") > > edgefac <- as.factor(ifelse(type.edge, yes="edge", no="interior")) > platefac <- as.factor(paste("P", plate, sep="")) > > expfact <- data.frame(treatment = treatfac, > platepos = edgefac, > plate = platefac > ) > > expdesign <- model.matrix(formula(~ treatment + plate + platepos), data=expfact) > > ## Estimating the dispersions > y <- DGEList(counts = reads) # reads is the 50x3072 matrix > y <- calcNormFactors(y) > y <- estimateGLMCommonDisp(y, design=expdesign, method="Pearson", verbose=TRUE) # faster than Cox-Reid and probably ok since there are many treatments > y <- estimateGLMTrendedDisp(y, design=expdesign) > y <- estimateGLMTagwiseDisp(y, design=expdesign) > ## Neither of the last two steps finish running in a day; same for estimateGLMCommonDisp() if method="CoxReid" > > I was then hoping to extract the contrasts of each treatment against the "siNT" control. > Would it make sense to combine the two technical factors, or subset the count and design matrices for each treatment in a way that reduces the number of treatments, and run them separately? Alternatively, I thought of doing the analysis separately for each treatment using the whole count matrix but amalgamating all other non-control treatments in an "other" group". This seems feasible when run in parallel, but it would be overkill... > Any suggestions on how to proceed? > > Kind regards, > Uli > > -- > Ulrich Braunschweig, PhD > > The Donnelly Centre > University of Toronto > 160 College Street, Room 1030 > Toronto, Ontario > Canada M5S 3E1 > > u.braunschweig at utoronto.ca > > -- output of sessionInfo(): > > R version 3.1.0 (2014-04-10) > Platform: x86_64-pc-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_IE.UTF-8 LC_COLLATE=en_CA.UTF-8 > [5] LC_MONETARY=en_IE.UTF-8 LC_MESSAGES=en_CA.UTF-8 > [7] LC_PAPER=en_IE.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_IE.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] edgeR_3.6.2 limma_3.20.4 ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

edgeR edgeR • 1.5k views

ADD COMMENT • link updated 9.9 years ago by Ulrich Braunschweig ▴ 10 • written 9.9 years ago by Gordon Smyth 50k

score 0 · Answer 1 · 2014-05-26

Dear Gordon, I tried both (on a reduced design matrix), with quite similar results. The quasi-likelihood approach seems a little more conservative at least with borderline significant genes, but since limma/voom is still orders of magnitude faster, that's what I will use. [I presume in the quasi-likelihood pipeline, the coefficients have to be specified in glmQLFTest() rather than topTags()] Thanks very much for your help! Uli On 24/05/14 20:57, Gordon K Smyth wrote: > Dear Uli, > > edgeR is probably the fastest of the glm negative binomial packages, > as we have done a lot of work moving all the fitting code to the C++ > level. Nevertheless, the ususal glm pipeline will become very slow > when there are over 3000 samples and the design matrix has 1500 columns. > > Here are two other possibilities. First you could switch to voom > instead. I ran voom on your simulated data example in a few minutes on > my laptop computer: > > y <- DGEList(counts = reads) > y <- calcNormFactors(y) > v <- voom(y,expdesign,plot=TRUE) > fit <- lmFit(v,expdesign) > fit <- eBayes(fit) > topTable(fit, coef=5) > > and so on. Alternatively, you would stick to edgeR and use the > quasi-likelihood pipeline: > > y <- DGEList(counts = reads) > y <- calcNormFactors(y) > y <- estimateGLMCommonDisp(y, design=expdesign, method="Pearson") > ql <- glmQLFTest(y,expdesign) > topTags(ql,coef=5) > > etc. The glmQLFTest() function will take more time than voom but less > time than estimateGLMCcommonDispersion on your data. The common > dispersion step is optional in this pipeline -- I include it because > you've already done it. > > Either of these approaches have good statistical motivations. See the > help pages for voom and glmQLFTest for references to published papers > describing them. > > Best wishes > Gordon > >> Date: Fri, 23 May 2014 09:06:21 -0700 (PDT) >> From: "Uli Braunschweig [guest]" <guest at="" bioconductor.org=""> >> To: bioconductor at r-project.org, u.braunschweig at utoronto.ca >> Subject: [BioC] Gene expression analysis with edgeR with a large, >> nested design matrix >> >> Hi All, >> >> My problem is the following: >> I have expression counts for 50 genes in an RNAi screen with 1,536 >> treatments (which includes positive and negative controls, so really >> 1,416 unique treatments) in two replicates, done in 96-well format >> (2x16 plates). I know that plate effects and edge effects (whether a >> well was located on the edge of a plate) are significant, so the >> design should include treatment, plate, and location (edge or >> interior). Locations are identical between replicates. Each plate has >> two negative controls ("siNT"), as well as other controls. >> >> I am only interested in the contrasts of each of the treatments vs. >> the "siNT" control. I thought that the model of edgeR would be useful >> to score significant hits while at the same time dealing with the >> mentioned technical biases in a meaningful manner. However, I've had >> to kill the analysis because estimating the trended and tag-wise >> dispersions takes excessively long. >> >> My question is: Is it even feasible to try to adress a problem with a >> design that has so few genes and so many treatments using edgeR (or >> DESeq2)? >> >> library(edgeR) >> >> ## Data look similar to this: >> reads <- matrix(round(2000 * rexp(50 * 3072)), nrow=50) # dense >> matrix of 50 genes x (1536 treatments in duplicate) >> >> ## Here is how I create my design factors: >> # (I've left out 'replicate' because it is a linear combination of >> plates 1-16 and 17-32) >> rows <- rep(rep(1:8, each=12), 32) >> cols <- rep(rep(1:12, 8), 32) >> plate <- rep(1:32, each=96) >> >> type.nt <- rows == 1 & cols == 1 | # the negative control to >> compare everything to; 2 per plate >> rows == 4 & cols == 9 >> type.posCtl <- rows == 4 & cols == 3 | # positive control; 2 per >> plate >> rows == 5 & cols == 9 >> type.mock <- rows == 3 & cols == 3 | # another control; 2 per plate >> rows == 8 & cols == 12 >> type.empty <- plate %in% c(16, 32) & ( # another control; a bunch >> on only 2 plates >> rows %in% c(2:3,6:8) & cols == 9 | >> cols %in% c(10,11) | >> rows %in% 1:7 & cols == 12 >> ) >> type.edge <- rows %in% c(1,8) | cols %in% c(1,12) # position on the >> plate >> >> treat <- rep(paste("T", 1:1536, sep=""), 2) # treatments >> treat[type.nt] <- "siNT" >> treat[type.mock] <- "mock" >> treat[type.empty] <- "empty" >> treat[type.posCtl] <- "siPosCtl" >> treatfac <- relevel(as.factor(treat), ref="siNT") >> >> edgefac <- as.factor(ifelse(type.edge, yes="edge", no="interior")) >> platefac <- as.factor(paste("P", plate, sep="")) >> >> expfact <- data.frame(treatment = treatfac, >> platepos = edgefac, >> plate = platefac >> ) >> >> expdesign <- model.matrix(formula(~ treatment + plate + platepos), >> data=expfact) >> >> ## Estimating the dispersions >> y <- DGEList(counts = reads) # reads is the 50x3072 matrix >> y <- calcNormFactors(y) >> y <- estimateGLMCommonDisp(y, design=expdesign, method="Pearson", >> verbose=TRUE) # faster than Cox-Reid and probably ok since there are >> many treatments >> y <- estimateGLMTrendedDisp(y, design=expdesign) >> y <- estimateGLMTagwiseDisp(y, design=expdesign) >> ## Neither of the last two steps finish running in a day; same for >> estimateGLMCommonDisp() if method="CoxReid" >> >> I was then hoping to extract the contrasts of each treatment against >> the "siNT" control. >> Would it make sense to combine the two technical factors, or subset >> the count and design matrices for each treatment in a way that >> reduces the number of treatments, and run them separately? >> Alternatively, I thought of doing the analysis separately for each >> treatment using the whole count matrix but amalgamating all other >> non-control treatments in an "other" group". This seems feasible when >> run in parallel, but it would be overkill... >> Any suggestions on how to proceed? >> >> Kind regards, >> Uli >> >> -- >> Ulrich Braunschweig, PhD >> >> The Donnelly Centre >> University of Toronto >> 160 College Street, Room 1030 >> Toronto, Ontario >> Canada M5S 3E1 >> >> u.braunschweig at utoronto.ca >> >> -- output of sessionInfo(): >> >> R version 3.1.0 (2014-04-10) >> Platform: x86_64-pc-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_IE.UTF-8 LC_COLLATE=en_CA.UTF-8 >> [5] LC_MONETARY=en_IE.UTF-8 LC_MESSAGES=en_CA.UTF-8 >> [7] LC_PAPER=en_IE.UTF-8 LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_IE.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] edgeR_3.6.2 limma_3.20.4 > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:6}}

score 0 · Answer 2 · 2014-05-27

On Mon, 26 May 2014, Ulrich Braunschweig wrote: > Dear Gordon, > > I tried both (on a reduced design matrix), with quite similar results. The > quasi-likelihood approach seems a little more conservative at least with > borderline significant genes, but since limma/voom is still orders of > magnitude faster, that's what I will use. > > [I presume in the quasi-likelihood pipeline, the coefficients have to be > specified in glmQLFTest() rather than topTags()] Yes, that's right. Gordon > Thanks very much for your help! > Uli > > On 24/05/14 20:57, Gordon K Smyth wrote: >> Dear Uli, >> >> edgeR is probably the fastest of the glm negative binomial packages, as we >> have done a lot of work moving all the fitting code to the C++ level. >> Nevertheless, the ususal glm pipeline will become very slow when there are >> over 3000 samples and the design matrix has 1500 columns. >> >> Here are two other possibilities. First you could switch to voom instead. >> I ran voom on your simulated data example in a few minutes on my laptop >> computer: >> >> y <- DGEList(counts = reads) >> y <- calcNormFactors(y) >> v <- voom(y,expdesign,plot=TRUE) >> fit <- lmFit(v,expdesign) >> fit <- eBayes(fit) >> topTable(fit, coef=5) >> >> and so on. Alternatively, you would stick to edgeR and use the >> quasi-likelihood pipeline: >> >> y <- DGEList(counts = reads) >> y <- calcNormFactors(y) >> y <- estimateGLMCommonDisp(y, design=expdesign, method="Pearson") >> ql <- glmQLFTest(y,expdesign) >> topTags(ql,coef=5) >> >> etc. The glmQLFTest() function will take more time than voom but less time >> than estimateGLMCcommonDispersion on your data. The common dispersion step >> is optional in this pipeline -- I include it because you've already done >> it. >> >> Either of these approaches have good statistical motivations. See the help >> pages for voom and glmQLFTest for references to published papers describing >> them. >> >> Best wishes >> Gordon >> >>> Date: Fri, 23 May 2014 09:06:21 -0700 (PDT) >>> From: "Uli Braunschweig [guest]" <guest at="" bioconductor.org=""> >>> To: bioconductor at r-project.org, u.braunschweig at utoronto.ca >>> Subject: [BioC] Gene expression analysis with edgeR with a large, >>> nested design matrix >>> >>> Hi All, >>> >>> My problem is the following: >>> I have expression counts for 50 genes in an RNAi screen with 1,536 >>> treatments (which includes positive and negative controls, so really 1,416 >>> unique treatments) in two replicates, done in 96-well format (2x16 >>> plates). I know that plate effects and edge effects (whether a well was >>> located on the edge of a plate) are significant, so the design should >>> include treatment, plate, and location (edge or interior). Locations are >>> identical between replicates. Each plate has two negative controls >>> ("siNT"), as well as other controls. >>> >>> I am only interested in the contrasts of each of the treatments vs. the >>> "siNT" control. I thought that the model of edgeR would be useful to score >>> significant hits while at the same time dealing with the mentioned >>> technical biases in a meaningful manner. However, I've had to kill the >>> analysis because estimating the trended and tag-wise dispersions takes >>> excessively long. >>> >>> My question is: Is it even feasible to try to adress a problem with a >>> design that has so few genes and so many treatments using edgeR (or >>> DESeq2)? >>> >>> library(edgeR) >>> >>> ## Data look similar to this: >>> reads <- matrix(round(2000 * rexp(50 * 3072)), nrow=50) # dense matrix of >>> 50 genes x (1536 treatments in duplicate) >>> >>> ## Here is how I create my design factors: >>> # (I've left out 'replicate' because it is a linear combination of plates >>> 1-16 and 17-32) >>> rows <- rep(rep(1:8, each=12), 32) >>> cols <- rep(rep(1:12, 8), 32) >>> plate <- rep(1:32, each=96) >>> >>> type.nt <- rows == 1 & cols == 1 | # the negative control to >>> compare everything to; 2 per plate >>> rows == 4 & cols == 9 >>> type.posCtl <- rows == 4 & cols == 3 | # positive control; 2 per plate >>> rows == 5 & cols == 9 >>> type.mock <- rows == 3 & cols == 3 | # another control; 2 per plate >>> rows == 8 & cols == 12 >>> type.empty <- plate %in% c(16, 32) & ( # another control; a bunch on >>> only 2 plates >>> rows %in% c(2:3,6:8) & cols == 9 | >>> cols %in% c(10,11) | >>> rows %in% 1:7 & cols == 12 >>> ) >>> type.edge <- rows %in% c(1,8) | cols %in% c(1,12) # position on the plate >>> >>> treat <- rep(paste("T", 1:1536, sep=""), 2) # treatments >>> treat[type.nt] <- "siNT" >>> treat[type.mock] <- "mock" >>> treat[type.empty] <- "empty" >>> treat[type.posCtl] <- "siPosCtl" >>> treatfac <- relevel(as.factor(treat), ref="siNT") >>> >>> edgefac <- as.factor(ifelse(type.edge, yes="edge", no="interior")) >>> platefac <- as.factor(paste("P", plate, sep="")) >>> >>> expfact <- data.frame(treatment = treatfac, >>> platepos = edgefac, >>> plate = platefac >>> ) >>> >>> expdesign <- model.matrix(formula(~ treatment + plate + platepos), >>> data=expfact) >>> >>> ## Estimating the dispersions >>> y <- DGEList(counts = reads) # reads is the 50x3072 matrix >>> y <- calcNormFactors(y) >>> y <- estimateGLMCommonDisp(y, design=expdesign, method="Pearson", >>> verbose=TRUE) # faster than Cox-Reid and probably ok since there are many >>> treatments >>> y <- estimateGLMTrendedDisp(y, design=expdesign) >>> y <- estimateGLMTagwiseDisp(y, design=expdesign) >>> ## Neither of the last two steps finish running in a day; same for >>> estimateGLMCommonDisp() if method="CoxReid" >>> >>> I was then hoping to extract the contrasts of each treatment against the >>> "siNT" control. >>> Would it make sense to combine the two technical factors, or subset the >>> count and design matrices for each treatment in a way that reduces the >>> number of treatments, and run them separately? Alternatively, I thought of >>> doing the analysis separately for each treatment using the whole count >>> matrix but amalgamating all other non-control treatments in an "other" >>> group". This seems feasible when run in parallel, but it would be >>> overkill... >>> Any suggestions on how to proceed? >>> >>> Kind regards, >>> Uli >>> >>> -- >>> Ulrich Braunschweig, PhD >>> >>> The Donnelly Centre >>> University of Toronto >>> 160 College Street, Room 1030 >>> Toronto, Ontario >>> Canada M5S 3E1 >>> >>> u.braunschweig at utoronto.ca >>> >>> -- output of sessionInfo(): >>> >>> R version 3.1.0 (2014-04-10) >>> Platform: x86_64-pc-linux-gnu (64-bit) >>> >>> locale: >>> [1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C >>> [3] LC_TIME=en_IE.UTF-8 LC_COLLATE=en_CA.UTF-8 >>> [5] LC_MONETARY=en_IE.UTF-8 LC_MESSAGES=en_CA.UTF-8 >>> [7] LC_PAPER=en_IE.UTF-8 LC_NAME=C >>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>> [11] LC_MEASUREMENT=en_IE.UTF-8 LC_IDENTIFICATION=C >>> >>> attached base packages: >>> [1] stats graphics grDevices utils datasets methods base >>> >>> other attached packages: >>> [1] edgeR_3.6.2 limma_3.20.4 ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}