Question

DESeq2 multiple factors nested design

0

Entering edit mode

Guest User ★ 13k

@guest-user-4897

Last seen 9.6 years ago

Dear all, I have data collected from an experiment looking at expression differences related to diet and species. I have samples from six species, and for each species I have five paired clones, each pair exposed to one of two diets. The clones are genetically identical but were raised independently (see below for data structure ??? first three species only). species diet clone sp1 A c1 sp1 B c1 sp1 A c2 sp1 B c2 sp1 A c3 sp1 B c3 sp1 A c4 sp1 B c4 sp1 A c5 sp1 B c5 sp2 A c6 sp2 B c6 sp2 A c7 sp2 B c7 sp2 A c8 sp2 B c8 sp2 A c9 sp2 B c9 sp2 A c10 sp2 B c10 sp3 A c11 sp3 B c11 sp3 A c12 sp3 B c12 sp3 A c13 sp3 B c13 sp3 A c14 sp3 B c14 sp3 A c15 sp3 B c15 I am interested in genes differentially expressed between diet A and diet B, in genes differentially expressed between species, and in particular I am interested in identifying genes where differential expression between diets varies in relation to species (i.e. the diet- species interaction). My problem is in finding a way of accounting for clone within the model in DESeq. In an ideal world I would like to incorporate ???clone??? as a factor in the model, to account for differences between samples resulting from ???clone???, so in the first instance I???d like to compare the following: ~ clone + species + diet + species:diet ~ clone + species + diet However, I think because ???clone??? is nested within ???species??? so any given level of clone is only present within one biotype, simply putting clone into the design formula of DESeq2 does not work, any design including both clone and species, for example: dds <- DESeqDataSetFromMatrix(countData = countData, colData = colData,design = ~ clone+species+diet) returns the error ???invalid class ???DESeqDataSet??? object: the model matrix is not full rank, i.e. one or more variables in the design formula are linear combinations of the others???. Am I interpreting this error correctly? And if so, is there a way I could stipulate that clone is nested within species in DESeq? Many thanks for your help. -- output of sessionInfo(): R version 3.0.2 (2013-09-25) Platform: x86_64-apple-darwin10.8.0 (64-bit) locale: [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 attached base packages: [1] parallel stats graphics grDevices utils datasets methods base other attached packages: [1] Biobase_2.22.0 DESeq2_1.2.10 RcppArmadillo_0.4.100.2.1 Rcpp_0.11.1 [5] GenomicRanges_1.14.4 XVector_0.2.0 IRanges_1.20.7 BiocGenerics_0.8.0 loaded via a namespace (and not attached): [1] annotate_1.40.1 AnnotationDbi_1.24.0 DBI_0.2-7 genefilter_1.44.0 grid_3.0.2 [6] lattice_0.20-27 locfit_1.5-9.1 RColorBrewer_1.0-5 RSQLite_0.11.4 splines_3.0.2 [11] stats4_3.0.2 survival_2.37-7 XML_3.95-0.2 xtable_1.7-3 -- Sent via the guest posting facility at bioconductor.org.

DESeq DESeq2 DESeq DESeq2 • 5.6k views

ADD COMMENT • link updated 7.3 years ago by blawney • 0 • written 10.1 years ago by Guest User ★ 13k

score 0 · Answer 1 · 2014-03-27

0

Entering edit mode

Michael Love 41k

@mikelove

Last seen 10 minutes ago

United States

hi Isobel, On Thu, Mar 27, 2014 at 1:41 PM, Isobel [guest] <guest at="" bioconductor.org=""> wrote: > My problem is in finding a way of accounting for clone within the model in DESeq. In an ideal world I would like to incorporate ?clone? as a factor in the model, to account for differences between samples resulting from ?clone?, so in the first instance I?d like to compare the following: > > ~ clone + species + diet + species:diet > ~ clone + species + diet > > However, I think because ?clone? is nested within ?species? so any given level of clone is only present within one biotype, simply putting clone into the design formula of DESeq2 does not work, Note that, because the clones are entirely within a single species, the species effect is simply a linear combination of the clone effects. I'll show this with an example using normal data and a linear model: R> species <- factor(rep(1:2,each=4)) R> clone <- factor(rep(1:4,each=2)) R> cbind(species, clone) species clone [1,] 1 1 [2,] 1 1 [3,] 1 2 [4,] 1 2 [5,] 2 3 [6,] 2 3 [7,] 2 4 [8,] 2 4 R> set.seed(1) R> y <- rnorm(8) R> coef(lm(y ~ species)) (Intercept) species2 0.07921043 0.10448786 The species effect is 0.10448786, which we get by only including species. If we run the model with only clone: R> res <- coef(lm(y ~ clone)) R> names(res) [1] "(Intercept)" "clone2" "clone3" "clone4" The species effect is the average of clone 3 and clone 4 effect minus the average of clone 2 and clone 1 effect (clone 1 is absorbed by the intercept in this case, so 0): R> unname(.5*res["clone3"] + .5*res["clone4"] - (.5*res["clone2"])) [1] 0.1044879 > I am interested in genes differentially expressed between diet A and diet B > in genes differentially expressed between species, > and in particular I am interested in identifying genes where differential expression between diets varies in relation to species (i.e. the diet-species interaction). So then, getting back to how to accomplish these comparisons in DESeq2. I'd like to give advice using DESeq2 version >= 1.3, as this is simpler in the new version, and it will be released very soon (on April 14). If you can try out this code with the development branch, that would be great, if not, just email me and I can formulate this for version 1.2. For DESeq2 version >= 1.3, I would specify: design(dds) <- ~ clone + diet + clone:diet Then the following comparisons: comparison of diet B over diet A: results(dds, contrast=c("diet","B","A")) comparison of species 2 over species 1, this is the average of the clone effects for each species, so: ctrst <- numeric(resultsNames(dds)) sp2clones <- paste0("clone",dds$clone[dds$species == "2"]) sp1clones <- paste0("clone",dds$clone[dds$species == "1"]) ctrst[sp2clones] <- 1/5 ctrst[sp1clones] <- -1/5 results(dds, contast=ctrst) for the interaction between species 1 and diet, we pull out the interaction effects and average for each clone: ctrst <- numeric(resultsNames(dds)) sp1dietA <- paste0("clone",dds$clone[dds$species == "1"],".dietA") sp1dietB <- paste0("clone",dds$clone[dds$species == "1"],".dietB") ctrst[sp1dietB] <- 1/5 ctrst[sp1dietA] <- -1/5 results(dds, contast=ctrst) Mike

ADD COMMENT • link 10.1 years ago Michael Love 41k

0

Entering edit mode

sorry, the last two comparisons should instead start with: ctrst <- numeric(length(resultsNames(dds))) names(ctrst) <- resultsNames(dds) I will try to work up a simpler interface for these comparisons. Mike On Thu, Mar 27, 2014 at 8:10 PM, Michael Love <michaelisaiahlove at="" gmail.com=""> wrote: > hi Isobel, > > On Thu, Mar 27, 2014 at 1:41 PM, Isobel [guest] <guest at="" bioconductor.org=""> wrote: >> My problem is in finding a way of accounting for clone within the model in DESeq. In an ideal world I would like to incorporate ?clone? as a factor in the model, to account for differences between samples resulting from ?clone?, so in the first instance I?d like to compare the following: >> >> ~ clone + species + diet + species:diet >> ~ clone + species + diet >> >> However, I think because ?clone? is nested within ?species? so any given level of clone is only present within one biotype, simply putting clone into the design formula of DESeq2 does not work, > > > Note that, because the clones are entirely within a single species, > the species effect is simply a linear combination of the clone > effects. I'll show this with an example using normal data and a linear > model: > > R> species <- factor(rep(1:2,each=4)) > R> clone <- factor(rep(1:4,each=2)) > R> cbind(species, clone) > species clone > [1,] 1 1 > [2,] 1 1 > [3,] 1 2 > [4,] 1 2 > [5,] 2 3 > [6,] 2 3 > [7,] 2 4 > [8,] 2 4 > > R> set.seed(1) > R> y <- rnorm(8) > R> coef(lm(y ~ species)) > (Intercept) species2 > 0.07921043 0.10448786 > > The species effect is 0.10448786, which we get by only including species. > > If we run the model with only clone: > > R> res <- coef(lm(y ~ clone)) > R> names(res) > [1] "(Intercept)" "clone2" "clone3" "clone4" > > The species effect is the average of clone 3 and clone 4 effect minus > the average of clone 2 and clone 1 effect (clone 1 is absorbed by the > intercept in this case, so 0): > > R> unname(.5*res["clone3"] + .5*res["clone4"] - (.5*res["clone2"])) > [1] 0.1044879 > > > >> I am interested in genes differentially expressed between diet A and diet B >> in genes differentially expressed between species, >> and in particular I am interested in identifying genes where differential expression between diets varies in relation to species (i.e. the diet-species interaction). > > So then, getting back to how to accomplish these comparisons in > DESeq2. I'd like to give advice using DESeq2 version >= 1.3, as this > is simpler in the new version, and it will be released very soon (on > April 14). If you can try out this code with the development branch, > that would be great, if not, just email me and I can formulate this > for version 1.2. > > For DESeq2 version >= 1.3, I would specify: > > design(dds) <- ~ clone + diet + clone:diet > > Then the following comparisons: > > comparison of diet B over diet A: > > results(dds, contrast=c("diet","B","A")) > > comparison of species 2 over species 1, this is the average of the > clone effects for each species, so: > > ctrst <- numeric(resultsNames(dds)) > sp2clones <- paste0("clone",dds$clone[dds$species == "2"]) > sp1clones <- paste0("clone",dds$clone[dds$species == "1"]) > ctrst[sp2clones] <- 1/5 > ctrst[sp1clones] <- -1/5 > results(dds, contast=ctrst) > > for the interaction between species 1 and diet, we pull out the > interaction effects and average for each clone: > > ctrst <- numeric(resultsNames(dds)) > sp1dietA <- paste0("clone",dds$clone[dds$species == "1"],".dietA") > sp1dietB <- paste0("clone",dds$clone[dds$species == "1"],".dietB") > ctrst[sp1dietB] <- 1/5 > ctrst[sp1dietA] <- -1/5 > results(dds, contast=ctrst) > > > Mike

ADD REPLY • link 10.1 years ago Michael Love 41k

0

Entering edit mode

I was making the design unnecessarily complicated, all you need is: design(dds) <- ~ clone + diet + species:diet and then for the last comparison in your list, the interaction if the diet effect is different in species 1, is extracted with: results(dds, contrast=list("species1.dietB", "species1.dietA")) Mike On Thu, Mar 27, 2014 at 8:55 PM, Michael Love <michaelisaiahlove at="" gmail.com=""> wrote: > sorry, the last two comparisons should instead start with: > > ctrst <- numeric(length(resultsNames(dds))) > names(ctrst) <- resultsNames(dds) > > I will try to work up a simpler interface for these comparisons. > > Mike > > On Thu, Mar 27, 2014 at 8:10 PM, Michael Love > <michaelisaiahlove at="" gmail.com=""> wrote: >> hi Isobel, >> >> On Thu, Mar 27, 2014 at 1:41 PM, Isobel [guest] <guest at="" bioconductor.org=""> wrote: >>> My problem is in finding a way of accounting for clone within the model in DESeq. In an ideal world I would like to incorporate ?clone? as a factor in the model, to account for differences between samples resulting from ?clone?, so in the first instance I?d like to compare the following: >>> >>> ~ clone + species + diet + species:diet >>> ~ clone + species + diet >>> >>> However, I think because ?clone? is nested within ?species? so any given level of clone is only present within one biotype, simply putting clone into the design formula of DESeq2 does not work, >> >> >> Note that, because the clones are entirely within a single species, >> the species effect is simply a linear combination of the clone >> effects. I'll show this with an example using normal data and a linear >> model: >> >> R> species <- factor(rep(1:2,each=4)) >> R> clone <- factor(rep(1:4,each=2)) >> R> cbind(species, clone) >> species clone >> [1,] 1 1 >> [2,] 1 1 >> [3,] 1 2 >> [4,] 1 2 >> [5,] 2 3 >> [6,] 2 3 >> [7,] 2 4 >> [8,] 2 4 >> >> R> set.seed(1) >> R> y <- rnorm(8) >> R> coef(lm(y ~ species)) >> (Intercept) species2 >> 0.07921043 0.10448786 >> >> The species effect is 0.10448786, which we get by only including species. >> >> If we run the model with only clone: >> >> R> res <- coef(lm(y ~ clone)) >> R> names(res) >> [1] "(Intercept)" "clone2" "clone3" "clone4" >> >> The species effect is the average of clone 3 and clone 4 effect minus >> the average of clone 2 and clone 1 effect (clone 1 is absorbed by the >> intercept in this case, so 0): >> >> R> unname(.5*res["clone3"] + .5*res["clone4"] - (.5*res["clone2"])) >> [1] 0.1044879 >> >> >> >>> I am interested in genes differentially expressed between diet A and diet B >>> in genes differentially expressed between species, >>> and in particular I am interested in identifying genes where differential expression between diets varies in relation to species (i.e. the diet-species interaction). >> >> So then, getting back to how to accomplish these comparisons in >> DESeq2. I'd like to give advice using DESeq2 version >= 1.3, as this >> is simpler in the new version, and it will be released very soon (on >> April 14). If you can try out this code with the development branch, >> that would be great, if not, just email me and I can formulate this >> for version 1.2. >> >> For DESeq2 version >= 1.3, I would specify: >> >> design(dds) <- ~ clone + diet + clone:diet >> >> Then the following comparisons: >> >> comparison of diet B over diet A: >> >> results(dds, contrast=c("diet","B","A")) >> >> comparison of species 2 over species 1, this is the average of the >> clone effects for each species, so: >> >> ctrst <- numeric(resultsNames(dds)) >> sp2clones <- paste0("clone",dds$clone[dds$species == "2"]) >> sp1clones <- paste0("clone",dds$clone[dds$species == "1"]) >> ctrst[sp2clones] <- 1/5 >> ctrst[sp1clones] <- -1/5 >> results(dds, contast=ctrst) >> >> for the interaction between species 1 and diet, we pull out the >> interaction effects and average for each clone: >> >> ctrst <- numeric(resultsNames(dds)) >> sp1dietA <- paste0("clone",dds$clone[dds$species == "1"],".dietA") >> sp1dietB <- paste0("clone",dds$clone[dds$species == "1"],".dietB") >> ctrst[sp1dietB] <- 1/5 >> ctrst[sp1dietA] <- -1/5 >> results(dds, contast=ctrst) >> >> >> Mike

ADD REPLY • link 10.1 years ago Michael Love 41k

0

Entering edit mode

Hi Mike

I am also doing a nested experiment where I have 4 groups where 30 populations are nested, each group don't have equal number of population. And I have 2 sites for all 4 groups. My interest is to get populations and site interaction but I want to consider I have a fixed effect of groups.While I'm trying to run this:

dds <- DESeqDataSetFromMatrix(countData = counts, colData =
info,design = ~ POP+SITE)

design(dds)<-~ POP+SITE+GROUP:SITE
dds <- DESeq(dds,parallel = T)

I'm getting the following error:

estimating size factors
estimating dispersions
gene-wise dispersion estimates: 4 workers
Error: BiocParallel errors
  element index: 1, 2, 3, 4
  first error: the model matrix is not full rank, so the model cannot be fit as specified.
  One or more variables or interaction terms in the design formula are linear
  combinations of the others and must be removed.

  Please read the vignette section 'Model matrix not full rank':

  vignette('DESeq2')

I'm using DESeq2 version 1.16.1. I was looking in the help of DESeq2 and I realize I'm getting this error because my populations are nested in groups. So can you suggest me a design where I can get the interaction of population with sites but I can consider groups has a fixed effect?

Best,

Taslima

ADD REPLY • link 5.9 years ago taslima • 0

score 0 · Answer 2 · 2016-12-29

0

Entering edit mode

blawney • 0

@blawney-12090

Last seen 4.0 years ago

Resurrecting this thread a bit--

I'm trying to make sense of a similar design I'm working on and I figured I would use the design above as a guide for my understanding. To that end, I created some mock data for a gene (creating the desired main/interaction effects) and appended it to a count matrix created via DESeq2::makeExampleDESeqDataSet

Using a basic linear model I get my expected results (not surprising given that samples were perturbed with normal distributions). However, when I try to recover this by using DESeq2, I'm not getting reasonable results (or least it seems that way). This could be my own error, like misinterpreting the terms in the fit. However, it doesn't seem like the NB model, etc. should be THAT big of a difference, so maybe it's due to other factors like the estimated mean-variance relationship. I'm curious about how to best diagnose this and understand the discrepancy.

I documented this all in an iPython notebook at https://github.com/blawney/deseq2_model/blob/master/clone_example.ipynb . Sorry for the python, but that section is not really important for the question. It's a bit pedantic since I was trying to think through it exhaustively, but I think the figures are quite evident. Rscripts and data files are also in that same repository.

Thanks for any comments, help, insight you can provide! I'm also hoping this is more generally helpful to the community.

ADD COMMENT • link 7.3 years ago blawney • 0

0

Entering edit mode

I only gave your notebook a quick scan, but if you simulated all of the genes to have the same pattern, it will be removed by size factor normalization. DESeq2 is performing normalization while sm.ols is not. You must correct for sequencing depth differences using size factors when it comes to real data, but if you want to set this part aside for your simulation you can set sizeFactors(dds) <- rep(1, ncol(dds)) before you run DESeq() and it will not perform size factor normalization.

ADD REPLY • link 7.3 years ago Michael Love 41k

0

Entering edit mode

Thanks for responding! I understand it's a bit of a dive-in with the notebook.

I initially left out any size-factor correction since DESeq2::makeExampleDESeqDataSet was creating matrices with size factors all pretty close to one based on the default argument to that method. I tried again with your suggestion and it made virtually no difference in the final regressions using the DESeq coefficients. I did uncover a plotting error when using the DESeq2-estimated coefficients so the regression lines now look indistinguishable from the OLS lines.

If I simulate the effect sizes even larger, I eventually get DESeq2 to give me significant coefficients like the OLS; however, it still seems like the NB glm should be able to "find" the effect size I initially simulated. For example, my absolute "species" effect size is ~10 (based on the OLS coefficient and visual inspection) and baseMean (for my simulated gene, from the DESeq2::results method) is 15.25 giving dispersion ~0.36 (assuming my one gene did not significantly affect the estimated mean-dispersion relationship which was modeled as 4/x+0.1 in DESeq2::makeExampleDESeqDataSet ). Perhaps I'm too optimistic and the noise is still overwhelming the signal?

ADD REPLY • link 7.3 years ago blawney • 0