Search
Question: Limma: how to account for twin data (both MZ and DZ twins)?
0
7 months ago by
Simone0
Simone0 wrote:

Hi again,

I already posted my question, but was asked to open a new one because it was suggested to use limma instead of DESeq2:
C: DESeq2 design formula: how to account for twin data (MZ and DZ)?

A summary of the relevant parts of my question:

I have to analyze some RNA-seq data, and I have got 400 samples of human whole blood (proportions of cell types available from FACS blood counts), all females, 40-86 years old. My phenotype of interest is disase as compared to healthy controls. There is also an intermediate stage, i.e. there are three disease states, but this intermediate stage is less interesting at the moment.

The problem: There are both monozygotic (MZ, n=220) and dizygotic (DZ, n=120) twins in the data. Additionally, there are singletons (n=60 samples with no twin pair). The different groups are not evenly distributed across my phenotype of interest, and the design is quite unbalanced in general:

• 10 disease samples (2 DZ twins, rest singletons)
• 20 intermediate samples (singletons)
• 370 healthy controls (MZ twins, DZ twins, singletons)

I would like to look at both differential expression, but also differential variability of expression (e.g. DiffVar) between the disease and healthy controls (I will probably have to use subsets of the data to have similar numbers of samples in each group, as I have only 10 disease samples, but 370 controls).

Additionally, we would like to have the possibility to look into twin effects as well, which is one of the reasons why I do not simply want to remove one of the two twins for each of them to have singletons only.

Thus I was wondering about how to correctly specify the design formula for these data.  Obviously - apart from covariates like cell subpopulations and perhaps age - I would like to take the twin pairs into account, as I expect a strong genetic effect. So, building blocks using the duplicateCorrelation() function could be useful, as also pointed out by Michael in my previous post. However, I would expect a much stronger effect for MZ twins than for DZ twins, so I do not think it is a good idea to treat all twin pairs equally (?) Adding the information of MZ and DZ as a covariate to the design formula will also not help, I suppose, as the effect occurs between the twin pairs, not across all individual samples within the group of MZs and DZs. It might work by adding an interaction term?

Any advice on how to build the blocks and/or define the design formula correctly would be highly appreciated.

More details about my doubts and concerns can be found in my original post (see link above).

Thank you very much for your help.

Simone

modified 7 months ago by Aaron Lun19k • written 7 months ago by Simone0

This is not a direct answer, but have you done anything to quantify the difference in correlation between twins vs the correlation between unrelated individuals? Maybe it will turn out that the excess correlation between twins is not as large as you suspect. I would also try running duplicateCorrelation with only the DZ twins and only the MZ twins to see how much they differ from each other. And I guess you could randomly pair up the unrelated samples and run duplicateCorrelation as a "control" to compare the twins' duplicate correlation values to. These steps might give you information that allows you to simplify the experimental design to the point where modeling it in limma is more straightforward.

0
7 months ago by
Aaron Lun19k
Cambridge, United Kingdom
Aaron Lun19k wrote:

There's no reason why you need to use the same (sub)set of samples for differential expression and differential variability and testing for twin effects. So, I'm just going to address the issue of how to best perform a DE analysis with limma (assuming the counts have been voomed). Here's a few thoughts:

1. For starters, you should check whether the DZ and MZ twins actually have different correlations. This can be done by subsetting the normal samples into DZ and MZ subsets and running duplicateCorrelation on each subset separately. It may be that the correlations are very similar between the two twin types, in which case you don't have to worry.
2. If they are not similar, then you will just have to drop one twin, as duplicateCorrelation does not provide any facility for handling different random effects. I would probably drop one from each MZ pair, given that you would like to preserve as much DZ information in the few disease samples that you have.
3. Of course, duplicateCorrelation doesn't come for free. It requires more time to run and is slightly anticonservative. Given you have so many samples, I wonder whether you could obtain satisfactory results by simply ignoring one twin in any pair (DZ or MZ) and performing the DE analysis with singletons only. You can also try Ryan's suggestion, though I have no idea whether the correlation would be close to zero. I guess it depends on the contribution of genotype to expression.
4. There is no need to subset groups to equalize the sample sizes for a valid analysis with limma. Of greater concern - particularly for patient data - are hidden factors of variation in your groups. I can imagine that there is more variability within your normal samples due to genotype/lifestyle/etc., especially if disease samples are more prevalent in a subset of the population. This may require methods such as RUVseq or sva to identify and model such factors.
1. I ran duplicateCorrelation on filtered and voom transformed counts. I did it for MZ, DZ, and I also created random pairs of singletons:
> dupcor.mz$consensus [1] 0.5322475 > dupcor.dz$consensus
[1] 0.3874335
> dupcor.rd\$consensus
[1] 0.1919584
This looks reasonable to me and is what I would have expected to see.

2. Agree, that's what I thought.

3. As discussed in my previous post, I was also thinking about removing all twins (only one of the two individuals of course) for my main analysis. For any twin effects, I could prepare a set of MZ twins only - as we are more interested in those in general. However, I have one pair of DZ twins in the disease set, which could be valuable.
And in terms of comparability across the different analyses I thought that it would be better to only have one dataset, if there was a possibility to deal with it correctly.

Also, if I do the analyses on separate subsets of the data, should I already split everything before voom normalization? Because removing twins or not will surely effect variance estimations. Usually, I always prefer to have as much data as possible, but as I cannot really tell voom about the twin data structure it might be better to not do so?

4. With this (equalizing sample size), I referred to the analysis of differential variability. In my experience, 9 samples is quite at the limit for any meaningful analyses of variability. I think that comparing the variability of 9 disease samples to the variability of more than 200 control samples will not work well. I will have to play around with this (of course, I will also look at SVA etc., but this in any case), probably ending up doing a more exploratory type of analysis than statistical testing. Still, I would like to do the best I can with these data, and prepare it in a suitable way.

Regarding (3): don't worry about having different data subsets for different types of analyses. Yes, we would always like to use as much of the data as we can, but sometimes this causes more trouble than it's worth. Also, give voom whatever you plan to use for the rest of the DE analysis; so you should split everything beforehand.

Regarding (4): subsampling will not change the expected variance of the groups. It will only serve to reduce the precision of the variance estimate of the larger group, and thus reduce the power to detect any differences in variability between groups. I would be surprised if a statistically rigorous method for detecting differential variability could not handle differences in group size - which pops up all the time in unbalanced designs - but hey, I guess anything's possible.

Re 3: I will split the data before voom, this seems indeed to be the better approach in this particular case.
Re 4: With this, I will surely play around and try different options, with all samples and also comparing random subsamples of 9 to see how much the 'variability varies' within control samples when having such a small sample size. We will see what happens. Always good to explore :-)
Thank you for the useful discussion and feedback.