Multifactor nested design in DESeq2 with non-paired data
1
0
Entering edit mode
chap02 • 0
@chap02-9528
Last seen 6.6 years ago

Hello,

I have a multi-factor design RNA-seq experiment I am analyzing with DESeq2.  2 strains, 2 treatments, 3 tissues.  Here is the colData:

  treatment strain tissue
1 control A kidney
2 control A kidney
3 control A kidney
4 control A kidney
5 control A kidney
6 control A kidney
7 control A kidney
8 control B kidney
9 control B kidney
10 control B kidney
11 control B kidney
12 control B kidney
13 treated A kidney
14 treated A kidney
15 treated A kidney
16 treated A kidney
17 treated A kidney
18 treated A kidney
19 treated B kidney
20 treated B kidney
21 treated B kidney
22 treated B kidney
23 treated B kidney
24 treated B kidney
25 control A liver
26 control A liver
27 control A liver
28 control A liver
29 control A liver
30 control A liver
31 control B liver
32 control B liver
33 control B liver
34 control B liver
35 control B liver
36 treated A liver
37 treated A liver
38 treated A liver
39 treated A liver
40 treated A liver
41 treated A liver
42 treated B liver
43 treated B liver
44 treated B liver
45 treated B liver
46 treated B liver
47 treated B liver
48 control A lung
49 control A lung
50 control A lung
51 control A lung
52 control A lung
53 control A lung
54 control A lung
55 control B lung
56 control B lung
57 control B lung
58 control B lung
59 control B lung
60 treated A lung
61 treated A lung
62 treated A lung
63 treated A lung
64 treated A lung
65 treated A lung
66 treated B lung
67 treated B lung
68 treated B lung
69 treated B lung
70 treated B lung
71 treated B lung

I would like to do nested comparisons, but I am running into complications because I do not have paired samples.

I would like to ask:

1.) what is the treatment effect for each tissue within each strain?

2.) what is the difference in the treatment effect for each tissue within the same strain?

3.) what is the difference in the treatment effect for the same tissue, but between the two strains?

First, I tried this approach:

colData$strain.nested = factor(paste(colData$strain,colData$tissue, sep=""))
dds = DESeqDataSetFromMatrix(countData = countData,colData = colData, design = ~ treatment + strain.nested:treatment + tissue:treatment)
dds = DESeq(dds, modelMatrixType="standard")
results(dds, name="strain.nestedAkidney.treatmenttreated")
results(dds, contrast=list("strain.nestedAkidney.treatmenttreated", "strain.nestedBkidney.treatmenttreated")

But, I get the following error message after the modelMatrixType call:

"Error in estimateDispersionsGeneEst(object, maxit = maxit, quiet = quiet) :
  the model matrix is not full rank, so the model cannot be fit as specified.
  one or more variables or interaction terms in the design formula
  are linear combinations of the others and must be removed"

Next, I tried to build a custom model matrix to solve this issue, using the bnbinomLRT function, as described in a previous post C: DESeq2 paired and multi factor comparison

dds = DESeqDataSetFromMatrix(countData = countData,colData = colData, design = ~treatment)
dds$treatment = relevel(dds$treatment, "control")
mm1 <- model.matrix(~ treatment + treatment:strain + treatment:tissue, colData)
idx <- which(colSums(mm1 == 0) == nrow(mm1))
mm1 <- mm1[,-idx]
mm0 <- model.matrix(~ treatment + treatment:strain, colData)

design(dds) <- ~ treatment + treatment:strain + treatment:tissue
dds <- estimateSizeFactors(dds)
dds <- estimateDispersionsGeneEst(dds, modelMatrix=mm1)

dds <- estimateDispersionsFit(dds)
dds <- estimateDispersionsMAP(dds, modelMatrix=mm1)

dds <- nbinomLRT(dds, full=mm1, reduced=mm0)

However, I appear to have too many columns with missing data after columns with all 0 are removed, and I receive the following error message after the estimateDispersionsGeneEst(dds, modelMatrix=mm1) call:

"Error in solve.default(qr.R(qrx)) : 'a' (1 x 0) must be square"

I have also considered creating a new column in my colData that describes a group according to their treatment/tissue/strain combination and completely simplifying the design to ~group. However, if I do that, I am not sure that contrast statements between each group would be a correct comparison to answer my questions.

Any advice on how to appropriately set up this experiment would be greatly appreciated!

My apologies if there a post out there that discusses this, and if so, please point me in that direction.

 

> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] DESeq2_1.6.3              RcppArmadillo_0.5.100.1.0 Rcpp_0.11.6              
[4] GenomicRanges_1.18.4      GenomeInfoDb_1.2.5        IRanges_2.0.1            
[7] S4Vectors_0.4.0           BiocGenerics_0.12.1      

loaded via a namespace (and not attached):
 [1] acepack_1.3-3.3      annotate_1.44.0      AnnotationDbi_1.28.2 base64enc_0.1-3     
 [5] BatchJobs_1.6        BBmisc_1.9           Biobase_2.26.0       BiocParallel_1.0.3  
 [9] brew_1.0-6           checkmate_1.6.3      cluster_2.0.3        codetools_0.2-14    
[13] colorspace_1.2-6     DBI_0.3.1            digest_0.6.9         fail_1.3            
[17] foreach_1.4.3        foreign_0.8-66       Formula_1.2-1        genefilter_1.48.1   
[21] geneplotter_1.44.0   ggplot2_2.0.0        grid_3.1.1           gridExtra_2.0.0     
[25] gtable_0.1.2         Hmisc_3.17-1         iterators_1.0.8      lattice_0.20-33     
[29] latticeExtra_0.6-26  locfit_1.5-9.1       magrittr_1.5         munsell_0.4.2       
[33] nnet_7.3-11          plyr_1.8.3           RColorBrewer_1.1-2   rpart_4.1-10        
[37] RSQLite_1.0.0        scales_0.3.0         sendmailR_1.2-1      splines_3.1.1       
[41] stringi_1.0-1        stringr_1.0.0        survival_2.38-3      tools_3.1.1         
[45] XML_3.98-1.3         xtable_1.8-0         XVector_0.6.0      
DESeq2 multiple factor design • 3.0k views
ADD COMMENT
1
Entering edit mode
@mikelove
Last seen 15 hours ago
United States

hi,

I appreciate the time you spent reading the vignette as well as related posts on the support site.

So, the 'nested' section in the vignette may have led you in a bit of a wrong direction. That recommendation is really for designs in which some levels of factors are present only within a level of another factor, for example, patient 1-3 are within control and patient 4-6 and within case, and suppose we want to look at treatment differences across case and control, while accounting for patient differences. 

In your case, you have a block design, where treatment and strain are repeated equally for all tissues. So this is even easier to approach.

First, you should upgrade to the latest version of R and Bioconductor, which will give you DESeq2 version 1.10. You can do this by installing the latest version of R, and then installing Bioconductor (http://bioconductor.org/install). It's best practice to stay with the current version of Bioconductor (which updates every April and October). DESeq2 v1.6 is about 1.5 years old (released October 2014), and so you're missing a number of bug fixes, simplified interface, and some new useful features.

It looks like you want to model a treatment effect for each tissue and a treatment effect for each strain. I'd recommend a design of ~ tissue + strain + tissue:treatment + strain:treatment. This will produce terms controlling tissue and strain overall (what we call main effects), and then specific treatment effect for each strain and for each tissue. Take a look at resultsNames(dds) to see what coefficients this design gives you. Note that there is a treatment effect for each tissue, but then a single interaction for treatment and strain B. You'll see how this works below:

For your questions:

1.) what is the treatment effect for each tissue within each strain?

This would be, for example the treatment effect for kidney in strain A:

results(dds, name="tissuekidney.treatmenttreated")

For the treatment effect in kidney in strain B, you add the interaction term which is the additional effect of treatment in strain B.

results(dds, contrast=list(c("tissuekidney.treatmenttreated","strainB.treatmenttreated")))

2.) what is the difference in the treatment effect for each tissue within the same strain?

This is the same for strain A and strain B. You can make pairwise comparisons between tissues like so:

results(dds, contrast=list("tissueliver.treatmenttreated","tissuekidney.treatmenttreated"))

3.) what is the difference in the treatment effect for the same tissue, but between the two strains?

Likewise, this is the same for the tissues, and is the single interaction term:

results(dds, name="strainB.treatmenttreated")
ADD COMMENT
0
Entering edit mode

Hi Michael, thank you very much for your response and for clarifying my misunderstanding of the usage of nested approaches.  I'm glad that the appropriate design is simpler than I anticipated!

I have a few follow-up questions:

Forgive my confusion, as I admit I am continuing to try to better understand the use of interaction terms, but I don't quite understand why

results(dds, name="tissuekidney.treatmenttreated")

would yield results for the treatment effect in stain A.  Would this not simply give me the overall treatment effect in the kidney, regardless of strain? Similarly, I would expect that

results(dds,contrast=list(c("tissueliver.treatmenttreated","tissuekidney.treatmenttreated")))

would tell me the difference in treatment effect between kidney and liver, regardless of the strain. To find this contrast for only strainB, is it correct to use the following:

results(dds, contrast=list(c("tissueliver.treatmentBD","tissuekidney.treatmentBD", "strainCAST.treatmentBD")))

?

Also, for my original question 3), "what is the difference in the treatment effect for the same tissue, but between the two strains?" I am not sure I described it appropriately. I want to find, for example, the difference in the treatment effect in the kidney of strainA vs. that of the kidney of strainB. This is where I get a bit hung up, because I want to pull out 3 attributes for each group for the comparison (e.g. the sample is a kidney, from strainB, and is treated). For this, I considered introducing another column into my colData to merge strain and tissue, subsequently including an additional interaction term, and then pull out contrasts, something like

results(dds,contrast=list(c("tissueliverstrainA.treatmenttreated","tissueliverstrainB.treatmenttreated")))

 

Thanks again for your help. I will be updating my R and Bioconductor.

ADD REPLY
0
Entering edit mode

first I want to point out I made a mistake in my earlier code for question #2 above. I have fixed it. I previously wrote c() but it should be list() to compare two terms.

"I don't quite understand why ... would yield results for treatment effect in strain A"

This is the kidney treatment effect in strain A. Why strain A only? This you'll have to either see by looking at the resultsNames(dds) and figuring out what each coefficient is doing in the model, or by talking with someone who can help explain how interaction models work, or ideally both. To bypass DESeq() and immediately see what coefficients will result from a given design, you can make the model matrix yourself with:

model.matrix(~ ..., data=colData(dds))

You can then examine which samples (rows) are given which coefficients (columns). If you have further questions, you will have to ask someone locally with a background in statistics or linear models to explain it. It's much easier to explain with a whiteboard and face-to-face.

The difference between treatment in liver and kidney can be found in my answer to #2 above. This difference in the treatment effect across tissue is not different across strain with the given design formula. Similarly for finding the difference in the treatment effect across strain specifically in the kidney. 

Unless you add second-order interaction terms to the model, you are assuming these differences are consistent across strain or across tissue respectively. Adding more terms makes the model more complex, and I personally would want to find evidence to justify second-order interactions before adding them. And then I really think you should get some local quantitative help with interpreting the model if you're not familiar with interaction models, as second-order interactions make things even more difficult to interpret. The coefficients will be the same as with a standard linear model in R, nothing is specific to DESeq2 here.

 

ADD REPLY
0
Entering edit mode

Hi Michael, viewing the model matrix with the coefficients certainly helped me to make sense of things.  But yes, I could certainly benefit from discussing interaction models with a statistician. Which may be necessary too if second-order interaction terms are introduced; I do not want to assume that differences are consistent across strains.  Quite the opposite, I expect the treatment effect to be different between these two strains.  Originally, I had created two completely separate DESeq2 analyses, one for each strain, evaluating the treatment effect across the 3 different tissues.  But, under the advice of including all data from a particular biological experiment together into DESeq2, I began to try to build the more complex model, which led to this post.

I will need to continue to ponder if the design formula we have discussed answers my questions, or if I need to expand it in some way.

Thanks again for your help!

 

ADD REPLY
0
Entering edit mode

Sounds reasonable. One comment:

"I do not want to assume that differences are consistent across strains. Quite the opposite, I expect the treatment effect to be different between these two strains."

This is just where it gets confusing. The model I proposed, the treatment effect is different across the two strains and across the three tissues. For a given strain x tissue pair, the total effect is additive in the log count scale.

The second-order interaction terms are for when you believe that the differences in treatment effect which you see across tissue may themselves be different across strain. Or another way to say this: that the first-order interaction terms for strain:treatment and tissue:treatment are not additive in the log count scale for strain x tissue pairs.

ADD REPLY

Login before adding your answer.

Traffic: 845 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6