I have to analyze some RNA-seq data and I am using DESeq2 for this. However, the experimental design is somewhat complicated in this case and I am unsure about how to correctly define the design matrix for DESeq2.
I am mostly following the “RNA-seq workflow: gene-level exploratory analysis and differential expression” workflow on Bioconductor and have already got a SummarizedExperiment. Now, I want to convert my data into a DESeqDataSet: http://www.bioconductor.org/help/workflows/rnaseqGene/#starting-from-summarizedexperiment. For this step, I need to specify the design formula:
ddsSE <- DESeqDataSet(se, design = ~ [covariates] + status)
About my data:
There are around 400 human peripheral blood samples in the data set. The phenotype of interest is a disease (“
status”). I have 10 disease samples, 17 samples of an intermediate stage, and the rest of the 400 are healthy controls. Samples are all derived from females between 40 and 86 years old, so usually I would like to add a covariate for age. Then, there are different proportions of different blood cell subtypes (proportions obtained by FACS) in each sample, which we would (usually) like to take into account by adding further covariates.
So far, so good. But in these data, I also have twins. So, I would like to take the pairs into account, of course, as I expect a strong genetic effect. The problem: I have both monozygotic (MZ, around 220 samples) and dizygotic (DZ, around 120 samples) twins, and I do not think it is a good idea to treat these pairs equally. Adding the information of MZ and DZ as a covariate will also not help, I suppose, as the effect occurs between the twin pairs, not across all individual samples within the group of MZs and DZs.
To make it even more difficult, I have singletons as well (no sibling sample in the data). I do also not have disease-discordant twins. Instead, I have a DZ twin pair who both have the disease. The rest of twins are all within the healthy control samples.
As stated before, the main phenotype of interest is disease (as compared to healthy). But we would also like to be able to look at the twin-effect on the side, in comparison to some other analyses in another data set (but still for the same study). Thus, simply removing the twins to have singletons only is not a good solution. We would not really loose power for the main comparison of disease vs healthy (I will only use matched singleton subsamples for the healthy group anyhow), but we would have no chance to do the additional analyses on twins we would like to perform.
400 samples, whole blood (blood counts/proportions of cell types available), females, 40-86 years old
220 MZ twins, 120 DZ twins, 60 singletons
3 phenotypes of interest:
- 10 disease samples (2 DZ twins, rest singletons)
- 20 intermediate samples (singletons)
- 370 healthy controls (MZ twins, DZ twins, singletons)
We are interested in differential expression, but also in differential variability of expression between the phenotypes of interest.
Additionally, we would like to look at twin effects.
What I thought we could possibly do:
Prepare a data set with singletons only for the case-control analyses:
design = ~ age + celltype1 + celltype2 [+ ...] + status
Prepare another data set with healthy MZ twins only for the complementary twin analyses
design = ~ age + celltype1 + celltype2 [+ ...] + familyID
Actually, I would prefer to have it all in one, if there was a way of doing so. Because normalization etc. will change expression values between the separate (sub) data sets, and that might render the data and results less comparable. But using the data set as a whole without taking the different types of twins effectively into account would be even worse, in my opinion.
Which strategy would you advice? Would there be an appropriate way of keeping all samples within the data but still model it correctly, taking the different types of twins into account?
Any recommendation on how to do best deal with these data would be highly appreciated.