Hello,
I'm trying to perform a time course analysis of RNAseq data to identify genes that are differentially expressed as a function of experimental condition/how they change over time however, I have an unbalanced experimental design. I've read through the documentation, vignettes and various posts and tried the most straightforward work arounds (not a biostatistician; new to RNAseq analysis) but none fully addressed my experimental set-up/analysis needs.
In the study, animals are subjected to sham or injury and collected at 1, 3, 7, 10 or 15 days post-injury. There is also a "control" condition in which samples are collected before injury (time 0). Animals are age/sex/genotype matched so the control at time 0 is meant to be a baseline measurement for both the sham and injury conditions. There are 3 biological replicates for each condition x time and each replicate is N = 20 animals.
experimental condition | time (days post-injury) |
control | 0 |
sham | 1 |
sham | 3 |
sham | 7 |
sham | 10 |
sham | 15 |
injury | 1 |
injury | 3 |
injury | 7 |
injury | 10 |
injury | 15 |
Using the example at: http://www.bioconductor.org/help/workflows/rnaseqGene/#time, I used the LRT to identify differentially expressed genes using the following code:
ddsMatrix <- DESeqDataSetFromMatrix(countData = countdata, colData = coldata, design = ~ condition + days + condition:days)
ddsTC <- DESeq(ddsMatrix, test="LRT", reduced = ~ condition + days)
Not surprisingly, I ran into the “Model matrix not full rank” error because of missing samples (no control for time 1-15 days; no sham/injury for time 0). I tried the following:
1. re-name the control samples as sham then delete the empty zero columns - doesn't work because the sham time 0 data is the reference level/intercept so there is no zeros column to delete in the model matrix.
2. exclude the control data from the analysis - this removes the error and allows for DE analysis however, it seems like the sham vs injury day 1 data becomes the "time 0" reference which is problematic because significantly DE genes for later days (3, 7, 10, 15) are determined based on the sham vs injury expression on day 1. Since many genes are differentially expressed due to injury on day 1, this approach really muddles interpretation of the data.
Since what I really want is to use the control time 0 data as a baseline measurement for both the sham and injury conditions I decided to duplicate the control data in the count matrix (3 biological replicates of control time 0 -> 2 x 3 replicates of control time 0) and re-name 3 as "sham time 0" and 3 as "injury time 0". I'm comfortable with doing this in terms of the biology because I genuinely expect the sham and injury groups to be identical before the experiment. Statistically, I don't know if this is appropriate. I consulted with my local biostatistician and his suggestion was to contact DESeq authors to see how the program would handle this.
Questions:
Is duplicating the control data an appropriate solution or does it introduce bias during the DESeq normalization?
Is there a better way to address the problem?
Thank you!
Removed my first response because I realized it was answered in your response. I'll go ahead and drop the control_0 data from the analysis and try your first suggestion of constructing a contrast between sham and injury on each day. Thank you!
My solution for now is:
1. Remove the control 0 data
2. Use a ~group model --> injury_1, injury_3, injury_7, etc
3. Use design = ~group and run DESeq() [no test specification]
4. Use contrast to compare conditions on each day, ie: results(dds, contrast=c("group", "injury_1", "sham_1")
You seem to caution against this approach above - could you clarify? My preliminary analysis goal is to identify the genes that are DE at each time point after injury, rather than identifying which genes are different between time points. A direct pairwise comparison would give me this information and I could ask more nuanced questions in the future.
Yes, if you're preliminary objective is as you say, then the group approach is best, in my opinion. My hesitancy was due to your Since many genes are differentially expressed due to injury on day 1, this approach really muddles interpretation of the data quote, as these will still persist. If you're tempted to filter later days results by the day1 differentials, then I'd implore you to at least try a 'contrast=list(c("injury_7", "sham_1"), c("injury1", "sham_7"))' approach, which looks at fold-changes above and beyond those seen at day1, and doesn't suffer from the double exposure to statistical errors that a venn diagram-like approach does. But as you say, these aren't your immediate concern.
Another issue in the 'group' approach is shrinkage. I'd use DESeq(..., betaPrior=FALSE) in this instance (if it isn't already the default setting in your version of DESeq2), because you've got 45 possible pairwise comparisons between the 10 conditions, and so the moderation of the estimation of fold-changes is unlikely to be specific to the comparisons you're doing.
I'm using DESeq2 v 1.16.1 so I believe the default is betaPrior = FALSE so I'll need to use the lfcShrink(). When I do this, the number of significant genes is MUCH higher than results using the unshrunken LFC and I was confused but after reading New function lfcShrink() in DESeq2 and A: DESeq2 lfcShrink() usage of coef vs. usage of contrast it seems like I should "use the p-values from the unshrunk LFC and use the shrunken LFC for visualization or ranking of genes". It looks like the lfcshrink() Wald test p-value is not actually for the groups I'm comparing - is this something I am doing wrong or the default in which case, do I go with the suggestion of using the p-values from unshrunk LFC [calling results with contrast]? Thanks
With lfcshrink():
dds <- DESeq(ddsMatrix)
res <-results(dds, alpha = 0.05)
resday10.shr <- lfcShrink(dds=dds, contrast=c("group", "injury_10", "sham_10"), res=res)
head(resday10.shr)
summary(resday10.shr)
log2 fold change (MAP): group injury_10 vs sham_10
Wald test p-value: group injury 7 vs sham 1
DataFrame with 6 rows and 5 columns
baseMean log2FoldChange stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric>
FBgn0000003 0.2321838 -0.008192420 0.10395375 0.9172061 NA
FBgn0000008 685.2158454 -0.167364213 0.08979044 0.9284537 0.9668097
FBgn0000014 0.1849456 -0.007123067 0.24597089 0.8057048 NA
FBgn0000015 0.6192357 -0.034028635 -0.91245317 0.3615302 NA
FBgn0000017 2896.5305983 0.041048921 -1.77995270 0.0750837 0.2137181
FBgn0000018 181.4742661 0.024009145 0.35328189 0.7238771 0.8501377
out of 15414 with nonzero total read count
adjusted p-value < 0.05
LFC > 0 (up) : 927, 6%
LFC < 0 (down) : 1169, 7.6%
outliers [1] : 18, 0.12%
low counts [2] : 4154, 27%
(mean count < 2)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
With results():
resday10 <- results(dds, contrast=c("group", "injury_10", "sham_10"))
head(resday10.shr)
summary(resday10)
log2 fold change (MLE): group injury_10 vs sham_10
Wald test p-value: group injury_10 vs sham_10
DataFrame with 6 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
FBgn0000003 0.2321838 -1.09652239 6.01306430 -0.1823567 0.8553028 NA
FBgn0000008 685.2158454 -0.17680818 0.12409102 -1.4248266 0.1542073 0.8434779
FBgn0000014 0.1849456 -1.09651939 6.45179408 -0.1699557 0.8650450 NA
FBgn0000015 0.6192357 -1.06650350 2.67733931 -0.3983445 0.6903762 NA
FBgn0000017 2896.5305983 0.04195420 0.07752053 0.5412012 0.5883689 0.9769969
FBgn0000018 181.4742661 0.02602603 0.15730572 0.1654487 0.8685908 0.9926741
out of 15414 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up) : 172, 1.1%
LFC < 0 (down) : 11, 0.071%
outliers [1] : 18, 0.12%
low counts [2] : 4451, 29%
(mean count < 3)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
I've been following this thread as I have some similar data that I'm wrestling with. With regards to the
lfcShrink Wald test p-value
, I think you passed the wrongDESeqResults
object to theres
parameter oflfcShrink
.Instead of
try
I suspect that's contributing to the differences in the number of significant genes. In my dataset, the number of sig genes is the same pre- and post-
lfcShrink()
.You're absolutely right - I passed the wrong DESeqResults object. After correction, the number of sig genes is the same pre and post lfcShrink(). Thank you!
I am working with a similar dataset to the one of the OP. Is there any chance to elaborate more on how to do:
The way I'd approach this is to create a model matrix with design of
~condition + day + condition:day
, and then remove the column for injury at time 0 -- this makes the assumption that there would be no difference at time 0 for the injury samples. Sometimes it's useful to create these samples to control for such differences, but here it sounds like you don't have them.Then you can pass this matrix to the
full
argument ofDESeq()
.Thank you very much, Michael, for your reply.
How do you remove the column for injury at time 0?
Executing the above code does not result in a column for injury at time 0.
I tried to delete the row names corresponding to the samples injury01, injury02, injury02 either from
coldata
or the matrix itself, but then I am receiving thethe model matrix is not full rank
error.Thanks for your help.
One way to tell which column represents the effect I’m referring to is the one that has all zeros in
mm
.Thanks again for your help. It must be something that I am doing wrong. Because in my case there are no columns with all zeros.
So, this is my samples list (note that I replicated the control as Sham0 and Injury0)
Then, I used
Then I constructed the following model.matrix
where colData is just a data frame of the sample list above.
Is there anything obvious that I am missing?
Also, would you recommend the same strategy if we considered the control at day 0 to be different from both the Sham samples and the Injury samples? In other words, if the control is considered as 'intact' but that the Sham group has an operation and the Injury group has an operation and injury.
What threw me off is when you described your dataset as similar to the OP. Yours is just like the example in the workflow and you can follow that code, and ignore this thread.
The OP is different in a critical way for analysis in that: “control at time 0 is meant to be a baseline measurement for both the sham and injury conditions”.
Well, that is exactly what I have. Control (time 0) is the baseline measurement for both conditions. I don't have two controls (I just duplicated it as Sham0 and Injury0). Then Sham is treated with surgery, and Injury is treated with surgery and injury.
Code them both as sham day 0 and follow the advice in the thread then.
Many thanks, Michael, once again for your help.
I have been enjoying some background reading to avoid asking further questions (without any clue) and here is the complete code which I arrived at.
By the way, if I do
design(dds)
then it gives~Condition + Day
and not the custom matrix model~Condition + Day + Condition:Day
.I hope this is how we can go about to solve this unbalanced experiment (at day0).
But I am still wondering how would you do a
group = p ∗ e (condition,day) condition
which Gavin suggested above. What doesp
ande
mean here? I would love to try the other suggestions as well. :)It's ok that
design(dds)
gives a formula.DESeq
prioritizes what's given tofull
and only usesdesign(dds)
if nothing is provided in the function call. FWIW You can also provide a matrix todesign(dds)
if you like.As far as choice among various designs for your experiment, that goes beyond the level of software support I have time to provide and I'd recommend collaborating with a statistician to work through these questions. Here on the support site I'm primarily trying to solve software related questions and issues, whereas there are limitless statistical analysis and design questions which I can't commit myself to helping to solve.
I am sure that everyone appreciates both the time and the efforts you're putting here. So, thank you for this. I totally understand that you might not have the time to comment on specific questions beyond the level of software support.
For me personally, I really think that it is useful to discuss different approaches of solving a problem and by leaving the question open, I was hoping for Gavin (or anyone else) to give some hints.
It turns out that the rendering system of the Bioconductor support site for the LaTex equations, at least on my system, is messing up the format.
What appears to be
group = p *e (condition,day)
is actuallygroup=paste(condition,day)
.