Question

Multifactor nested design in DESeq2 with non-paired data

0

Entering edit mode

chap02 • 0

@chap02-9528

Last seen 6.6 years ago

Hello,

I have a multi-factor design RNA-seq experiment I am analyzing with DESeq2. 2 strains, 2 treatments, 3 tissues. Here is the colData:

	treatment	strain	tissue
1	control	A	kidney
2	control	A	kidney
3	control	A	kidney
4	control	A	kidney
5	control	A	kidney
6	control	A	kidney
7	control	A	kidney
8	control	B	kidney
9	control	B	kidney
10	control	B	kidney
11	control	B	kidney
12	control	B	kidney
13	treated	A	kidney
14	treated	A	kidney
15	treated	A	kidney
16	treated	A	kidney
17	treated	A	kidney
18	treated	A	kidney
19	treated	B	kidney
20	treated	B	kidney
21	treated	B	kidney
22	treated	B	kidney
23	treated	B	kidney
24	treated	B	kidney
25	control	A	liver
26	control	A	liver
27	control	A	liver
28	control	A	liver
29	control	A	liver
30	control	A	liver
31	control	B	liver
32	control	B	liver
33	control	B	liver
34	control	B	liver
35	control	B	liver
36	treated	A	liver
37	treated	A	liver
38	treated	A	liver
39	treated	A	liver
40	treated	A	liver
41	treated	A	liver
42	treated	B	liver
43	treated	B	liver
44	treated	B	liver
45	treated	B	liver
46	treated	B	liver
47	treated	B	liver
48	control	A	lung
49	control	A	lung
50	control	A	lung
51	control	A	lung
52	control	A	lung
53	control	A	lung
54	control	A	lung
55	control	B	lung
56	control	B	lung
57	control	B	lung
58	control	B	lung
59	control	B	lung
60	treated	A	lung
61	treated	A	lung
62	treated	A	lung
63	treated	A	lung
64	treated	A	lung
65	treated	A	lung
66	treated	B	lung
67	treated	B	lung
68	treated	B	lung
69	treated	B	lung
70	treated	B	lung
71	treated	B	lung

I would like to do nested comparisons, but I am running into complications because I do not have paired samples.

I would like to ask:

1.) what is the treatment effect for each tissue within each strain?

2.) what is the difference in the treatment effect for each tissue within the same strain?

3.) what is the difference in the treatment effect for the same tissue, but between the two strains?

First, I tried this approach:

colData$strain.nested = factor(paste(colData$strain,colData$tissue, sep=""))
dds = DESeqDataSetFromMatrix(countData = countData,colData = colData, design = ~ treatment + strain.nested:treatment + tissue:treatment)
dds = DESeq(dds, modelMatrixType="standard")
results(dds, name="strain.nestedAkidney.treatmenttreated")
results(dds, contrast=list("strain.nestedAkidney.treatmenttreated", "strain.nestedBkidney.treatmenttreated")

But, I get the following error message after the modelMatrixType call:

"Error in estimateDispersionsGeneEst(object, maxit = maxit, quiet = quiet) :
the model matrix is not full rank, so the model cannot be fit as specified.
one or more variables or interaction terms in the design formula
are linear combinations of the others and must be removed"

Next, I tried to build a custom model matrix to solve this issue, using the bnbinomLRT function, as described in a previous post C: DESeq2 paired and multi factor comparison

dds = DESeqDataSetFromMatrix(countData = countData,colData = colData, design = ~treatment)
dds$treatment = relevel(dds$treatment, "control")
mm1 <- model.matrix(~ treatment + treatment:strain + treatment:tissue, colData)
idx <- which(colSums(mm1 == 0) == nrow(mm1))
mm1 <- mm1[,-idx]
mm0 <- model.matrix(~ treatment + treatment:strain, colData)

design(dds) <- ~ treatment + treatment:strain + treatment:tissue
dds <- estimateSizeFactors(dds)
dds <- estimateDispersionsGeneEst(dds, modelMatrix=mm1)

dds <- estimateDispersionsFit(dds)
dds <- estimateDispersionsMAP(dds, modelMatrix=mm1)

dds <- nbinomLRT(dds, full=mm1, reduced=mm0)

However, I appear to have too many columns with missing data after columns with all 0 are removed, and I receive the following error message after the estimateDispersionsGeneEst(dds, modelMatrix=mm1) call:

"Error in solve.default(qr.R(qrx)) : 'a' (1 x 0) must be square"

I have also considered creating a new column in my colData that describes a group according to their treatment/tissue/strain combination and completely simplifying the design to ~group. However, if I do that, I am not sure that contrast statements between each group would be a correct comparison to answer my questions.

Any advice on how to appropriately set up this experiment would be greatly appreciated!

My apologies if there a post out there that discusses this, and if so, please point me in that direction.

> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] DESeq2_1.6.3              RcppArmadillo_0.5.100.1.0 Rcpp_0.11.6              
[4] GenomicRanges_1.18.4      GenomeInfoDb_1.2.5        IRanges_2.0.1            
[7] S4Vectors_0.4.0           BiocGenerics_0.12.1      

loaded via a namespace (and not attached):
 [1] acepack_1.3-3.3      annotate_1.44.0      AnnotationDbi_1.28.2 base64enc_0.1-3     
 [5] BatchJobs_1.6        BBmisc_1.9           Biobase_2.26.0       BiocParallel_1.0.3  
 [9] brew_1.0-6           checkmate_1.6.3      cluster_2.0.3        codetools_0.2-14    
[13] colorspace_1.2-6     DBI_0.3.1            digest_0.6.9         fail_1.3            
[17] foreach_1.4.3        foreign_0.8-66       Formula_1.2-1        genefilter_1.48.1   
[21] geneplotter_1.44.0   ggplot2_2.0.0        grid_3.1.1           gridExtra_2.0.0     
[25] gtable_0.1.2         Hmisc_3.17-1         iterators_1.0.8      lattice_0.20-33     
[29] latticeExtra_0.6-26  locfit_1.5-9.1       magrittr_1.5         munsell_0.4.2       
[33] nnet_7.3-11          plyr_1.8.3           RColorBrewer_1.1-2   rpart_4.1-10        
[37] RSQLite_1.0.0        scales_0.3.0         sendmailR_1.2-1      splines_3.1.1       
[41] stringi_1.0-1        stringr_1.0.0        survival_2.38-3      tools_3.1.1         
[45] XML_3.98-1.3         xtable_1.8-0         XVector_0.6.0

DESeq2 multiple factor design • 3.0k views

ADD COMMENT • link updated 8.3 years ago by Michael Love 41k • written 8.3 years ago by chap02 • 0

score 1 · Accepted Answer · 2016-01-16

1

Entering edit mode

Michael Love 41k

@mikelove

Last seen 15 hours ago

United States

hi,

I appreciate the time you spent reading the vignette as well as related posts on the support site.

So, the 'nested' section in the vignette may have led you in a bit of a wrong direction. That recommendation is really for designs in which some levels of factors are present only within a level of another factor, for example, patient 1-3 are within control and patient 4-6 and within case, and suppose we want to look at treatment differences across case and control, while accounting for patient differences.

In your case, you have a block design, where treatment and strain are repeated equally for all tissues. So this is even easier to approach.

First, you should upgrade to the latest version of R and Bioconductor, which will give you DESeq2 version 1.10. You can do this by installing the latest version of R, and then installing Bioconductor (http://bioconductor.org/install). It's best practice to stay with the current version of Bioconductor (which updates every April and October). DESeq2 v1.6 is about 1.5 years old (released October 2014), and so you're missing a number of bug fixes, simplified interface, and some new useful features.

It looks like you want to model a treatment effect for each tissue and a treatment effect for each strain. I'd recommend a design of ~ tissue + strain + tissue:treatment + strain:treatment. This will produce terms controlling tissue and strain overall (what we call main effects), and then specific treatment effect for each strain and for each tissue. Take a look at resultsNames(dds) to see what coefficients this design gives you. Note that there is a treatment effect for each tissue, but then a single interaction for treatment and strain B. You'll see how this works below:

For your questions:

1.) what is the treatment effect for each tissue within each strain?

This would be, for example the treatment effect for kidney in strain A:

results(dds, name="tissuekidney.treatmenttreated")

For the treatment effect in kidney in strain B, you add the interaction term which is the additional effect of treatment in strain B.

results(dds, contrast=list(c("tissuekidney.treatmenttreated","strainB.treatmenttreated")))

2.) what is the difference in the treatment effect for each tissue within the same strain?

This is the same for strain A and strain B. You can make pairwise comparisons between tissues like so:

results(dds, contrast=list("tissueliver.treatmenttreated","tissuekidney.treatmenttreated"))

3.) what is the difference in the treatment effect for the same tissue, but between the two strains?

Likewise, this is the same for the tissues, and is the single interaction term:

results(dds, name="strainB.treatmenttreated")

ADD COMMENT • link 8.3 years ago Michael Love 41k

0

Entering edit mode

Hi Michael, thank you very much for your response and for clarifying my misunderstanding of the usage of nested approaches. I'm glad that the appropriate design is simpler than I anticipated!

I have a few follow-up questions:

Forgive my confusion, as I admit I am continuing to try to better understand the use of interaction terms, but I don't quite understand why

results(dds, name="tissuekidney.treatmenttreated")

would yield results for the treatment effect in stain A. Would this not simply give me the overall treatment effect in the kidney, regardless of strain? Similarly, I would expect that

results(dds,contrast=list(c("tissueliver.treatmenttreated","tissuekidney.treatmenttreated")))

would tell me the difference in treatment effect between kidney and liver, regardless of the strain. To find this contrast for only strainB, is it correct to use the following:

results(dds, contrast=list(c("tissueliver.treatmentBD","tissuekidney.treatmentBD", "strainCAST.treatmentBD")))

?

Also, for my original question 3), "what is the difference in the treatment effect for the same tissue, but between the two strains?" I am not sure I described it appropriately. I want to find, for example, the difference in the treatment effect in the kidney of strainA vs. that of the kidney of strainB. This is where I get a bit hung up, because I want to pull out 3 attributes for each group for the comparison (e.g. the sample is a kidney, from strainB, and is treated). For this, I considered introducing another column into my colData to merge strain and tissue, subsequently including an additional interaction term, and then pull out contrasts, something like

results(dds,contrast=list(c("tissueliverstrainA.treatmenttreated","tissueliverstrainB.treatmenttreated")))

Thanks again for your help. I will be updating my R and Bioconductor.

ADD REPLY • link 8.3 years ago chap02 • 0

0

Entering edit mode

first I want to point out I made a mistake in my earlier code for question #2 above. I have fixed it. I previously wrote c() but it should be list() to compare two terms.

"I don't quite understand why ... would yield results for treatment effect in strain A"

This is the kidney treatment effect in strain A. Why strain A only? This you'll have to either see by looking at the resultsNames(dds) and figuring out what each coefficient is doing in the model, or by talking with someone who can help explain how interaction models work, or ideally both. To bypass DESeq() and immediately see what coefficients will result from a given design, you can make the model matrix yourself with:

model.matrix(~ ..., data=colData(dds))

You can then examine which samples (rows) are given which coefficients (columns). If you have further questions, you will have to ask someone locally with a background in statistics or linear models to explain it. It's much easier to explain with a whiteboard and face-to-face.

The difference between treatment in liver and kidney can be found in my answer to #2 above. This difference in the treatment effect across tissue is not different across strain with the given design formula. Similarly for finding the difference in the treatment effect across strain specifically in the kidney.

Unless you add second-order interaction terms to the model, you are assuming these differences are consistent across strain or across tissue respectively. Adding more terms makes the model more complex, and I personally would want to find evidence to justify second-order interactions before adding them. And then I really think you should get some local quantitative help with interpreting the model if you're not familiar with interaction models, as second-order interactions make things even more difficult to interpret. The coefficients will be the same as with a standard linear model in R, nothing is specific to DESeq2 here.

ADD REPLY • link 8.3 years ago Michael Love 41k

0

Entering edit mode

Hi Michael, viewing the model matrix with the coefficients certainly helped me to make sense of things. But yes, I could certainly benefit from discussing interaction models with a statistician. Which may be necessary too if second-order interaction terms are introduced; I do not want to assume that differences are consistent across strains. Quite the opposite, I expect the treatment effect to be different between these two strains. Originally, I had created two completely separate DESeq2 analyses, one for each strain, evaluating the treatment effect across the 3 different tissues. But, under the advice of including all data from a particular biological experiment together into DESeq2, I began to try to build the more complex model, which led to this post.

I will need to continue to ponder if the design formula we have discussed answers my questions, or if I need to expand it in some way.

Thanks again for your help!

ADD REPLY • link 8.3 years ago chap02 • 0

0

Entering edit mode

Sounds reasonable. One comment:

"I do not want to assume that differences are consistent across strains. Quite the opposite, I expect the treatment effect to be different between these two strains."

This is just where it gets confusing. The model I proposed, the treatment effect is different across the two strains and across the three tissues. For a given strain x tissue pair, the total effect is additive in the log count scale.

The second-order interaction terms are for when you believe that the differences in treatment effect which you see across tissue may themselves be different across strain. Or another way to say this: that the first-order interaction terms for strain:treatment and tissue:treatment are not additive in the log count scale for strain x tissue pairs.

ADD REPLY • link 8.3 years ago Michael Love 41k