I'm trying to perform a time course analysis of RNAseq data to identify genes that are differentially expressed as a function of experimental condition/how they change over time however, I have an unbalanced experimental design. I've read through the documentation, vignettes and various posts and tried the most straightforward work arounds (not a biostatistician; new to RNAseq analysis) but none fully addressed my experimental set-up/analysis needs.
In the study, animals are subjected to sham or injury and collected at 1, 3, 7, 10 or 15 days post-injury. There is also a "control" condition in which samples are collected before injury (time 0). Animals are age/sex/genotype matched so the control at time 0 is meant to be a baseline measurement for both the sham and injury conditions. There are 3 biological replicates for each condition x time and each replicate is N = 20 animals.
|experimental condition||time (days post-injury)|
Using the example at: http://www.bioconductor.org/help/workflows/rnaseqGene/#time, I used the LRT to identify differentially expressed genes using the following code:
ddsMatrix <- DESeqDataSetFromMatrix(countData = countdata, colData = coldata, design = ~ condition + days + condition:days)
ddsTC <- DESeq(ddsMatrix, test="LRT", reduced = ~ condition + days)
Not surprisingly, I ran into the “Model matrix not full rank” error because of missing samples (no control for time 1-15 days; no sham/injury for time 0). I tried the following:
1. re-name the control samples as sham then delete the empty zero columns - doesn't work because the sham time 0 data is the reference level/intercept so there is no zeros column to delete in the model matrix.
2. exclude the control data from the analysis - this removes the error and allows for DE analysis however, it seems like the sham vs injury day 1 data becomes the "time 0" reference which is problematic because significantly DE genes for later days (3, 7, 10, 15) are determined based on the sham vs injury expression on day 1. Since many genes are differentially expressed due to injury on day 1, this approach really muddles interpretation of the data.
Since what I really want is to use the control time 0 data as a baseline measurement for both the sham and injury conditions I decided to duplicate the control data in the count matrix (3 biological replicates of control time 0 -> 2 x 3 replicates of control time 0) and re-name 3 as "sham time 0" and 3 as "injury time 0". I'm comfortable with doing this in terms of the biology because I genuinely expect the sham and injury groups to be identical before the experiment. Statistically, I don't know if this is appropriate. I consulted with my local biostatistician and his suggestion was to contact DESeq authors to see how the program would handle this.
Is duplicating the control data an appropriate solution or does it introduce bias during the DESeq normalization?
Is there a better way to address the problem?