Hello,
I have a multi-factor design RNA-seq experiment I am analyzing with DESeq2. 2 strains, 2 treatments, 3 tissues. Here is the colData:
treatment | strain | tissue | |
1 | control | A | kidney |
2 | control | A | kidney |
3 | control | A | kidney |
4 | control | A | kidney |
5 | control | A | kidney |
6 | control | A | kidney |
7 | control | A | kidney |
8 | control | B | kidney |
9 | control | B | kidney |
10 | control | B | kidney |
11 | control | B | kidney |
12 | control | B | kidney |
13 | treated | A | kidney |
14 | treated | A | kidney |
15 | treated | A | kidney |
16 | treated | A | kidney |
17 | treated | A | kidney |
18 | treated | A | kidney |
19 | treated | B | kidney |
20 | treated | B | kidney |
21 | treated | B | kidney |
22 | treated | B | kidney |
23 | treated | B | kidney |
24 | treated | B | kidney |
25 | control | A | liver |
26 | control | A | liver |
27 | control | A | liver |
28 | control | A | liver |
29 | control | A | liver |
30 | control | A | liver |
31 | control | B | liver |
32 | control | B | liver |
33 | control | B | liver |
34 | control | B | liver |
35 | control | B | liver |
36 | treated | A | liver |
37 | treated | A | liver |
38 | treated | A | liver |
39 | treated | A | liver |
40 | treated | A | liver |
41 | treated | A | liver |
42 | treated | B | liver |
43 | treated | B | liver |
44 | treated | B | liver |
45 | treated | B | liver |
46 | treated | B | liver |
47 | treated | B | liver |
48 | control | A | lung |
49 | control | A | lung |
50 | control | A | lung |
51 | control | A | lung |
52 | control | A | lung |
53 | control | A | lung |
54 | control | A | lung |
55 | control | B | lung |
56 | control | B | lung |
57 | control | B | lung |
58 | control | B | lung |
59 | control | B | lung |
60 | treated | A | lung |
61 | treated | A | lung |
62 | treated | A | lung |
63 | treated | A | lung |
64 | treated | A | lung |
65 | treated | A | lung |
66 | treated | B | lung |
67 | treated | B | lung |
68 | treated | B | lung |
69 | treated | B | lung |
70 | treated | B | lung |
71 | treated | B | lung |
I would like to do nested comparisons, but I am running into complications because I do not have paired samples.
I would like to ask:
1.) what is the treatment effect for each tissue within each strain?
2.) what is the difference in the treatment effect for each tissue within the same strain?
3.) what is the difference in the treatment effect for the same tissue, but between the two strains?
First, I tried this approach:
colData$strain.nested = factor(paste(colData$strain,colData$tissue, sep="")) dds = DESeqDataSetFromMatrix(countData = countData,colData = colData, design = ~ treatment + strain.nested:treatment + tissue:treatment) dds = DESeq(dds, modelMatrixType="standard") results(dds, name="strain.nestedAkidney.treatmenttreated") results(dds, contrast=list("strain.nestedAkidney.treatmenttreated", "strain.nestedBkidney.treatmenttreated")
But, I get the following error message after the modelMatrixType call:
"Error in estimateDispersionsGeneEst(object, maxit = maxit, quiet = quiet) :
the model matrix is not full rank, so the model cannot be fit as specified.
one or more variables or interaction terms in the design formula
are linear combinations of the others and must be removed"
Next, I tried to build a custom model matrix to solve this issue, using the bnbinomLRT function, as described in a previous post C: DESeq2 paired and multi factor comparison
dds = DESeqDataSetFromMatrix(countData = countData,colData = colData, design = ~treatment) dds$treatment = relevel(dds$treatment, "control") mm1 <- model.matrix(~ treatment + treatment:strain + treatment:tissue, colData) idx <- which(colSums(mm1 == 0) == nrow(mm1)) mm1 <- mm1[,-idx] mm0 <- model.matrix(~ treatment + treatment:strain, colData) design(dds) <- ~ treatment + treatment:strain + treatment:tissue dds <- estimateSizeFactors(dds) dds <- estimateDispersionsGeneEst(dds, modelMatrix=mm1) dds <- estimateDispersionsFit(dds) dds <- estimateDispersionsMAP(dds, modelMatrix=mm1) dds <- nbinomLRT(dds, full=mm1, reduced=mm0)
However, I appear to have too many columns with missing data after columns with all 0 are removed, and I receive the following error message after the estimateDispersionsGeneEst(dds, modelMatrix=mm1) call:
"Error in solve.default(qr.R(qrx)) : 'a' (1 x 0) must be square"
I have also considered creating a new column in my colData that describes a group according to their treatment/tissue/strain combination and completely simplifying the design to ~group. However, if I do that, I am not sure that contrast statements between each group would be a correct comparison to answer my questions.
Any advice on how to appropriately set up this experiment would be greatly appreciated!
My apologies if there a post out there that discusses this, and if so, please point me in that direction.
> sessionInfo() R version 3.1.1 (2014-07-10) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods base other attached packages: [1] DESeq2_1.6.3 RcppArmadillo_0.5.100.1.0 Rcpp_0.11.6 [4] GenomicRanges_1.18.4 GenomeInfoDb_1.2.5 IRanges_2.0.1 [7] S4Vectors_0.4.0 BiocGenerics_0.12.1 loaded via a namespace (and not attached): [1] acepack_1.3-3.3 annotate_1.44.0 AnnotationDbi_1.28.2 base64enc_0.1-3 [5] BatchJobs_1.6 BBmisc_1.9 Biobase_2.26.0 BiocParallel_1.0.3 [9] brew_1.0-6 checkmate_1.6.3 cluster_2.0.3 codetools_0.2-14 [13] colorspace_1.2-6 DBI_0.3.1 digest_0.6.9 fail_1.3 [17] foreach_1.4.3 foreign_0.8-66 Formula_1.2-1 genefilter_1.48.1 [21] geneplotter_1.44.0 ggplot2_2.0.0 grid_3.1.1 gridExtra_2.0.0 [25] gtable_0.1.2 Hmisc_3.17-1 iterators_1.0.8 lattice_0.20-33 [29] latticeExtra_0.6-26 locfit_1.5-9.1 magrittr_1.5 munsell_0.4.2 [33] nnet_7.3-11 plyr_1.8.3 RColorBrewer_1.1-2 rpart_4.1-10 [37] RSQLite_1.0.0 scales_0.3.0 sendmailR_1.2-1 splines_3.1.1 [41] stringi_1.0-1 stringr_1.0.0 survival_2.38-3 tools_3.1.1 [45] XML_3.98-1.3 xtable_1.8-0 XVector_0.6.0
Hi Michael, thank you very much for your response and for clarifying my misunderstanding of the usage of nested approaches. I'm glad that the appropriate design is simpler than I anticipated!
I have a few follow-up questions:
Forgive my confusion, as I admit I am continuing to try to better understand the use of interaction terms, but I don't quite understand why
would yield results for the treatment effect in stain A. Would this not simply give me the overall treatment effect in the kidney, regardless of strain? Similarly, I would expect that
would tell me the difference in treatment effect between kidney and liver, regardless of the strain. To find this contrast for only strainB, is it correct to use the following:
?
Also, for my original question 3), "what is the difference in the treatment effect for the same tissue, but between the two strains?" I am not sure I described it appropriately. I want to find, for example, the difference in the treatment effect in the kidney of strainA vs. that of the kidney of strainB. This is where I get a bit hung up, because I want to pull out 3 attributes for each group for the comparison (e.g. the sample is a kidney, from strainB, and is treated). For this, I considered introducing another column into my colData to merge strain and tissue, subsequently including an additional interaction term, and then pull out contrasts, something like
Thanks again for your help. I will be updating my R and Bioconductor.
first I want to point out I made a mistake in my earlier code for question #2 above. I have fixed it. I previously wrote c() but it should be list() to compare two terms.
"I don't quite understand why ... would yield results for treatment effect in strain A"
This is the kidney treatment effect in strain A. Why strain A only? This you'll have to either see by looking at the resultsNames(dds) and figuring out what each coefficient is doing in the model, or by talking with someone who can help explain how interaction models work, or ideally both. To bypass DESeq() and immediately see what coefficients will result from a given design, you can make the model matrix yourself with:
You can then examine which samples (rows) are given which coefficients (columns). If you have further questions, you will have to ask someone locally with a background in statistics or linear models to explain it. It's much easier to explain with a whiteboard and face-to-face.
The difference between treatment in liver and kidney can be found in my answer to #2 above. This difference in the treatment effect across tissue is not different across strain with the given design formula. Similarly for finding the difference in the treatment effect across strain specifically in the kidney.
Unless you add second-order interaction terms to the model, you are assuming these differences are consistent across strain or across tissue respectively. Adding more terms makes the model more complex, and I personally would want to find evidence to justify second-order interactions before adding them. And then I really think you should get some local quantitative help with interpreting the model if you're not familiar with interaction models, as second-order interactions make things even more difficult to interpret. The coefficients will be the same as with a standard linear model in R, nothing is specific to DESeq2 here.
Hi Michael, viewing the model matrix with the coefficients certainly helped me to make sense of things. But yes, I could certainly benefit from discussing interaction models with a statistician. Which may be necessary too if second-order interaction terms are introduced; I do not want to assume that differences are consistent across strains. Quite the opposite, I expect the treatment effect to be different between these two strains. Originally, I had created two completely separate DESeq2 analyses, one for each strain, evaluating the treatment effect across the 3 different tissues. But, under the advice of including all data from a particular biological experiment together into DESeq2, I began to try to build the more complex model, which led to this post.
I will need to continue to ponder if the design formula we have discussed answers my questions, or if I need to expand it in some way.
Thanks again for your help!
Sounds reasonable. One comment:
"I do not want to assume that differences are consistent across strains. Quite the opposite, I expect the treatment effect to be different between these two strains."
This is just where it gets confusing. The model I proposed, the treatment effect is different across the two strains and across the three tissues. For a given strain x tissue pair, the total effect is additive in the log count scale.
The second-order interaction terms are for when you believe that the differences in treatment effect which you see across tissue may themselves be different across strain. Or another way to say this: that the first-order interaction terms for strain:treatment and tissue:treatment are not additive in the log count scale for strain x tissue pairs.