I am running an experiment where we have samples from a number of individuals, pre and post treatment. These individuals were then also grouped according to features found in the DNA. So, overall the design looks like this. I named pre treatment "before" so that DESeq2 will automatically take it as the first level in the factor. All three groups are of interest to compare to the other two (none of the 3 is a control).
Sample Patient Treatment Group 1 PatientA-pre PatientA before Group3 2 PatientA-post PatientA post Group3 3 PatientB-pre PatientB before Group3 4 PatientB-post PatientB post Group3 5 PatientC-pre PatientC before Group1 6 PatientC-post PatientC post Group1 7 PatientD-pre PatientD before Group3 8 PatientD-post PatientD post Group3 9 PatientE-pre PatientE before Group1 10 PatientE-post PatientE post Group1 11 PatientF-pre PatientF before Group1 12 PatientF-post PatientF post Group1 13 PatientG-pre PatientG before Group1 14 PatientG-post PatientG post Group1 15 PatientH-pre PatientH before Group3 16 PatientH-post PatientH post Group3 17 PatientI-pre PatientI before Group1 18 PatientI-post PatientI post Group1 19 PatientJ-pre PatientJ before Group1 20 PatientJ-post PatientJ post Group1 21 PatientK-pre PatientK before Group1 22 PatientK-post PatientK post Group1 23 PatientL-pre PatientL before Group2 24 PatientL-post PatientL post Group2 25 PatientM-pre PatientM before Group2 26 PatientM-post PatientM post Group2 27 PatientN-pre PatientN before Group2 28 PatientN-post PatientN post Group2 29 PatientO-pre PatientO before Group2 30 PatientO-post PatientO post Group2 31 PatientP-pre PatientP before Group2 32 PatientP-post PatientP post Group2 33 PatientQ-pre PatientQ before Group1 34 PatientQ-post PatientQ post Group1 35 PatientR-pre PatientR before Group2 36 PatientR-post PatientR post Group2 37 PatientS-pre PatientS before Group3 38 PatientS-post PatientS post Group3
We are interested in determining the interaction between groups and treatment. In other words, do certain genes have a different change between pre and post depending on group?
I know if I did not have to control for individual (Patient), I would run like so (maybe afterward also re-running with group1 not as the first level so can test that):
dds <- DESeq(DESeqDataSetFromMatrix(countData=featureCounts_pairs,colData=design_table,design=~Treatment+Group+Treatment:Group)) results_group2 <- results(dds,name="Treatmentpost.GroupGroup2") results_group3 <- results(dds,name="Treatmentpost.GroupGroup3")
However, I would also like to control for patient differences within each group. So that if a gene goes up or down consistently between pre and post within all patients of one group but not the other, it will show up, even if there is a lot of variability between all "before" or all "post" samples within each group.
Adding "+Patient" to the design above does not solve this issue, as then I get the "model matrix not full rank" error since each of the two samples for the same patient are also in the same group.
That makes sense, although for the last part ("now this matrix
m1
can be provided to thefull
argument of DESeq") I am still unclear on the syntax. Can you take a look at my code and let me know?First, I add the patient indices to the design table.
Next, I create the model matrix, and then remove columns with all zeroes.
Now, I am unclear on the syntax of actually making the DESeq object. Here is what I tried so far. Does this make sense?
Update: Tried running the following code afterwards, but get the same p-values no matter which two groups are being compared. Also there are way too many DEGs given the expected strength of the signal. Don't think the above design is correct, but not sure how to fix it.
See the section of
?results
about the LRT. It will explain what you observed.Finally got results that make sense. This is my code.
Only issue left is I am still getting 206 rows that do not converge in beta even after strict expression filtering and increasing maxit (following the tips here: https://support.bioconductor.org/p/65091/). Any other tips on how to get them all to converge? I know I can just remove those rows if I really need to, but I would prefer to just get them all to converge.
What do the counts look like for the rows that aren't converging? The convergence is slower when there are many coefficients to fit, but often this is where only a few samples have expression. You could change from 30 counts and 3 samples to less counts but more samples, e.g. 10 normalized counts but more samples required.
There are a few genes that are DE if you do not remove these rows, and for these the counts are often high in quite a few samples.
For example, here are normalized counts for one gene that did not converge, for group1 and group3. Note, in the real data there are 4 individuals for group 1 and 9 individuals for group3.
It also seems to make sense that this gene is showing up as DE for the interaction term. In group1, 3/4 individuals have strong downregulation of this gene in post, with no change pre/post in the 4th individual. In group3, although most individuals do also have downregulation in post, the change is not nearly as severe.
Here group3 indices 2,4,5,6, and 9 have downregulation in post, but the change is way less dramatic. Plus indices 1,3,and 7 actually have upregulation in post, with index 8 being constant.
Update: Also tried a super high number of iterations (maxit=10000) with only a few additional rows converging.
I think you can safely ignore the 'not converging' warning for now, which I think is overly conservative here. what's happening is a tradeoff between the post/before effect in a group and the per-group patient effect, which gives probably very small changes in beta that very slightly increase the log likelihood at each iteration. I'd guess the LFC and test statistics aren't changing much as you increase maxit.
I also realized just now that I did not supply the model matrix to estimateDispersions. The code should be estimateDispersions(dds,modelMatrix = design_table_model_matrix). I imagine this could have affected the results as well.
Yes, you need to supply any custom model matrix to each function (estimateDispersions(), and nbinom*), if you aren't using DESeq().