Hello,
I have been searching a looot about it, apologies if I have missed a solution.
I am trying to do a paired analysis in DESeq2, meaning having a paired samples design.
I have a dataset with two groups (which is my batch), each group has uneven number of subjects, and each subject can only be in one of the groups and have one or two samples (so one or both conditions). Note that my counts are estimated with kallisto. Just a toy example of how my colData look like:
sample batch condition subject
1 sample1 1 A S1
2 sample2 1 B S1
3 sample3 1 A S2
4 sample4 1 B S2
5 sample5 1 A S3
6 sample6 1 B S3
7 sample7 1 A S4
8 sample8 1 B S4
9 sample9 1 A S5
10 sample10 1 B S5
11 sample11 2 A S6
12 sample12 2 B S6
13 sample13 2 A S7
14 sample14 2 B S7
15 sample15 2 A S8
In this example, batch == 1 has 5 subjects with both conditions per subject, while batch == 2 has 3 subjects, one of which has only one condition. I simplified with keeping balanced paired samples with respect to the condition, so I filtered out sample15.
My goal is to test the condition effect while controlling for subject effects.
So initially I thought my model would be ~ batch + subject + condition. And the resultName I would look at to see the condition effect (while having controlled for batch and subject effects) is the 'condition_B_vs_A'. This model design leads to the "Model matrix not full rank" error.
dds = DESeqDataSetFromMatrix(countData = counts.mat, colData = sample.summary.balanced, design = ~ batch + subject + condition)
dds <- DESeq(dds)
Error in checkFullRank(modelMatrix) :
the model matrix is not full rank, so the model cannot be fit as specified.
One or more variables or interaction terms in the design formula are linear
combinations of the others and must be removed.
The problem is the linearity of batch and subject. The examples of linearity in the vignette do not exactly match the case here to my understanding.
While I have tried a few thoughts motivated of some conversations here (like my last resort was applying batch correction first, converting to positive integers and then run DESeq2 with ~ subject + condition, which gave zero DEGs), nothing I have tried works.
By the way, if I don't do a paired design and just have ~batch+condition the model works nicely. But I would like to take advantage of the fact that I have both conditions per subject.
Any insight would be greatly appreciated!
Thank you Michael for your response.
While I have seen the discussions that recommend limma with duplicateCorrelation, I was wondering if I could do it with DESeq2.
My confusion was in exactly that part of the vignette "Group-specific condition effects, individuals nested within groups" because I thought subject.n was kind of grouping two different subjects - but I have realized now what exactly it does so that's ok. But I am not looking for group-specific (or batch-specific in my case) condition effects but just condition effects while adjusting for group/batch and subject differences. So I tried the following model after I added subject.n variable:
But let me go one step before, where I got a convergence issue in the simple model, for which I tried two things that didn't resolve the issue. First to split DESeq function into steps and increase maxit, and second to filter even more low expressed genes.
How can I resolve the convergence issue? I even repeated the separate steps and added the useOptim option:
Also, to get the condition effect correcting for differences in batch and subjects, is it res1 or res2?
Jumping to the top, if you don’t want group specific condition effects then just use ~sample + condition. You don’t (and cannot) control at a higher resolution than sample.
If by group-specific we mean batch-specific (in this case), then I surely don't. I just need to correct for the batch.
When you say "You don’t (and cannot) control at a higher resolution than sample", you mean in DESeq2 or in general? If my model is ~sample + condition then how do I take into account that sample1 and sample2 comes from the same subject? Or that sample1 and sample14 are from different batches?
Sorry, above you have
subject
. Use~subject + condition
. You don't need to additionally control for anything for within which subject is nested.mmm I had this thought too, but isn't it between and within subjects? So for the between part, wouldn't I need to correct for batch? What do you mean when you say "within which subject is nested"? Is it nested now that
batch:subject.n
is not in this model?Also, still with the
~subject + condition
(which is the model I start with above) there are genes that do not converge.Subject is within batch -- perhaps discuss with a statistician how this is the case.
Re: convergence, I'd recommend the following:
For x chosen to avoid genes where you have a lot of zeros, maybe 7.
Thank you Michael, everything is clear now.
Indeed the convergence was resolved by raising
x = 2/3 of my samples
. Thanks a lot!If I may ask one more thing, why even though I reach convergence, I find no DEGs?
The only way I get some DEGs is if I filter by p-value alone. Not even FC. Here are the ranges of the log2FC, padj and p-values, and a volcano with p-values and log2FC distributions.
There are not large changes in your dataset. You cannot reject the null for these genes.
You may want to discuss with a statistician about powering future datasets.
I totally agree. But when I don't do paired analysis, and I use a simple ~batch+condition model, the power is very good. I filter by FDR and 4FC and get hundreds of DEGs. I wonder why the paired is so poor.
Sorry, I missed that part in your initial post.
Well I guess you could be on the verge of significance with the 100s, and by adding all the additional covariates for subject terms, you lose degrees of freedom.
I would note that, with accounting for patient baseline, these LFC are pretty small. So I wouldn't push to hard on the data when it's telling you the changes are small.
I understand your point and agree with you.
But when I use 1/10 of my samples for the paired analysis then I get like 1000 DEGs with p0.01 and 4FC thresholds. Still not good enough FDRs though. But I did expect having more samples would increase the power, not the opposite. It feels weird to me. But maybe you are right, maybe adding all these subject terms is minimizing any effect.
Would you think limma would be more appropriate?
I certainly don't want to push too hard the data, just want to make sure I'm doing it the right way..
On the software question side I have no more advice.
Your comments are much appreciated, thank you