Question: DESeq2 design formula: how to account for twin data (MZ and DZ)?
gravatar for Simone
28 days ago by
Simone0 wrote:

Hi all,

I have to analyze some RNA-seq data and I am using DESeq2 for this. However, the experimental design is somewhat complicated in this case and I am unsure about how to correctly define the design matrix for DESeq2.

I am mostly following the “RNA-seq workflow: gene-level exploratory analysis and differential expression” workflow on Bioconductor and have already got a SummarizedExperiment. Now, I want to convert my data into a DESeqDataSet: For this step, I need to specify the design formula:

ddsSE <- DESeqDataSet(se, design = ~ [covariates] + status)

About my data:

There are around 400 human peripheral blood samples in the data set. The phenotype of interest is a disease (“status”). I have 10 disease samples, 17 samples of an intermediate stage, and the rest of the 400 are healthy controls. Samples are all derived from females between 40 and 86 years old, so usually I would like to add a covariate for age. Then, there are different proportions of different blood cell subtypes (proportions obtained by FACS) in each sample, which we would (usually) like to take into account by adding further covariates.

So far, so good. But in these data, I also have twins. So, I would like to take the pairs into account, of course, as I expect a strong genetic effect. The problem: I have both monozygotic (MZ, around 220 samples) and dizygotic (DZ, around 120 samples) twins, and I do not think it is a good idea to treat these pairs equally. Adding the information of MZ and DZ as a covariate will also not help, I suppose, as the effect occurs between the twin pairs, not across all individual samples within the group of MZs and DZs.

To make it even more difficult, I have singletons as well (no sibling sample in the data). I do also not have disease-discordant twins. Instead, I have a DZ twin pair who both have the disease. The rest of twins are all within the healthy control samples.

As stated before, the main phenotype of interest is disease (as compared to healthy). But we would also like to be able to look at the twin-effect on the side, in comparison to some other analyses in another data set (but still for the same study). Thus, simply removing the twins to have singletons only is not a good solution. We would not really loose power for the main comparison of disease vs healthy (I will only use matched singleton subsamples for the healthy group anyhow), but we would have no chance to do the additional analyses on twins we would like to perform.


400 samples, whole blood (blood counts/proportions of cell types available), females, 40-86 years old
220 MZ twins, 120 DZ twins, 60 singletons
3 phenotypes of interest:

  • 10 disease samples (2 DZ twins, rest singletons)
  • 20 intermediate samples (singletons)
  • 370 healthy controls (MZ twins, DZ twins, singletons)

We are interested in differential expression, but also in differential variability of expression between the phenotypes of interest.
Additionally, we would like to look at twin effects.

What I thought we could possibly do:

  • Prepare a data set with singletons only for the case-control analyses:
    design = ~ age + celltype1 + celltype2 [+ ...] + status

  • Prepare another data set with healthy MZ twins only for the complementary twin analyses
    design = ~ age + celltype1 + celltype2 [+ ...] + familyID

Actually, I would prefer to have it all in one, if there was a way of doing so. Because normalization etc. will change expression values between the separate (sub) data sets, and that might render the data and results less comparable. But using the data set as a whole without taking the different types of twins effectively into account would be even worse, in my opinion.   

Which strategy would you advice? Would there be an appropriate way of keeping all samples within the data but still model it correctly, taking the different types of twins into account?

Any recommendation on how to do best deal with these data would be highly appreciated.
Thank you!


ADD COMMENTlink modified 27 days ago • written 28 days ago by Simone0

Hi Michael,

Thank you very much for your reply. This is interesting, and I have got two more questions:

1) Regarding DESeq2 on MZ and DZ twins

Out of curiosity, if there were no singletons in my data but only MZ and DZ twins, what would be the correct way of modelling it with DESeq2?

Could I apply a strategy similar to what is described in section “Group-specific condition effects, individuals nested within groups” of the DESeq2 vignette, introducing a factor zyg with levels MZ and DZ? Or would that be incorrect? Also, if something like this was possible, couldn’t I add a third level for zyg indicating the singletons (with all unique familyIDs within the group of singletons)?

2) Regarding limma on MZ and DZ twins

Thank you also for pointing out that using limma will be better in the case of this data set. I have used limma and duplicateCorrelation() to build blocks of paired samples before. However, also here, I am not sure how I could distinguish between MZ and DZ twins correctly. Should I just build blocks of twins irrespective zygosity using the familyIDs? Wouldn't that introduce a bias? And I would probably still need an additional interaction term for the design?

Sorry for the confusion, and again, thank you very much for your help.


ADD REPLYlink modified 27 days ago • written 27 days ago by Simone0

If there were no singletons, you would control using DESeq2 the within-twin variation using the methods listed in that section, but there would be more modeling choices to make to determine the design. E.g. do you want separate disease effects for MZ and DZ? I can't say what the design would be without having a longer conversation, but I don't think this is the right analysis path anyway, because you can't add the singletons to a design that controls for within-twin differences and then make comparisons across disease vs healthy.

Regarding how to use limma with duplicateCorrelation and deal with MZ, DZ and singletons, I'd recommend making a new post and tagging 'limma' so the package authors can guide you to a solution.

ADD REPLYlink written 27 days ago by Michael Love14k

Thank you, Michael. I understand.

However, would you be able to give a hint about where to find the section about modeling the different twin effects? When I search for the word "twin" in the DESeq2 vignette, I do not get any results, so I am not sure which section you refer to. I looked at the one updated in October 2017 available at I would just like to read about it.

I have posted a new question on how to proceed with limma:
Limma: how to account for twin data (both MZ and DZ twins)?

ADD REPLYlink written 27 days ago by Simone0

By "that section" I meant the section you referred to above, "Group-specific condition effects, individuals nested within groups". But now that I re-read your question, I'm not sure this section is relevant. I'm not sure what the point of the MZ and DZ pairs are in the analysis. I actually think duplicateCorrelation() is the only way to go, regardless, because want to control for the correlation and compare across groups that have different sets of pairs (and singletons).

ADD REPLYlink written 27 days ago by Michael Love14k
gravatar for Michael Love
28 days ago by
Michael Love14k
United States
Michael Love14k wrote:

If it weren't for the singletons, you could use DESeq2, controlling for within-twin effects, and fitting a different effect for different types of twins potentially (there is a section in the vignette on how to do this controlling, while still having group-specific condition/treatment effects).

However, I don't think it's possible to also add in the singletons in the analysis, where you are also controlling for within-twin effects, using the methods of DESeq2. This structure of the data points you to using limma instead, with the duplicateCorrelation() function. This is a more flexible way to deal with correlation structure among samples, where some samples may have a correlation (twins) and others not (singletons). It's just the case that some kinds of models can be fit by both methods, others only by the duplicateCorrelation() approach. So I'd recommend taking a look at the limma User Guide on this topic.

ADD COMMENTlink written 28 days ago by Michael Love14k

Just out of curiosity: how would you model a dataset with lets sat 100 twins Michael. Would you have one coefficient per twin (aka a 100 additional coefficient to estimate) or is there some way of generalizing it with e.g. a random effect?

ADD REPLYlink written 26 days ago by k.vitting.seerup10

DESeq2 only has fixed effects. In general, how to model or whether something can be done in DESeq2 depends on the question being asked, what are the measurement and experimental design, etc.

ADD REPLYlink written 26 days ago by Michael Love14k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 249 users visited in the last hour