Time course analysis in DESeq2: handling common baseline for two experimental conditions
1
4
Entering edit mode
c.nicole ▴ 50
@cnicole-14045
Last seen 5.8 years ago
Philadelphia

Hello,

I'm trying to perform a time course analysis of RNAseq data to identify genes that are differentially expressed as a function of experimental condition/how they change over time however, I have an unbalanced experimental design. I've read through the documentation, vignettes and various posts and tried the most straightforward work arounds (not a biostatistician; new to RNAseq analysis) but none fully addressed my experimental set-up/analysis needs.

In the study, animals are subjected to sham or injury and collected at 1, 3, 7, 10 or 15 days post-injury. There is also a "control" condition in which samples are collected before injury (time 0). Animals are age/sex/genotype matched so the control at time 0 is meant to be a baseline measurement for both the sham and injury conditions. There are 3 biological replicates for each condition x time and each replicate is N = 20 animals.

experimental condition time (days post-injury)
control 0
sham 1
sham 3
sham 7
sham 10
sham 15
injury 1
injury 3
injury 7
injury 10
injury 15

Using the example at: http://www.bioconductor.org/help/workflows/rnaseqGene/#time, I used the LRT to identify differentially expressed genes using the following code:

ddsMatrix <- DESeqDataSetFromMatrix(countData = countdata, colData = coldata, design = ~ condition + days + condition:days)
ddsTC <- DESeq(ddsMatrix, test="LRT", reduced = ~ condition + days)

Not surprisingly, I ran into the “Model matrix not full rank” error because of missing samples (no control for time 1-15 days; no sham/injury for time 0). I tried the following:

1. re-name the control samples as sham then delete the empty zero columns - doesn't work because the sham time 0 data is the reference level/intercept so there is no zeros column to delete in the model matrix.

2. exclude the control data from the analysis - this removes the error and allows for DE analysis however, it seems like the sham vs injury day 1 data becomes the "time 0" reference which is problematic because significantly DE genes for later days (3, 7, 10, 15) are determined based on the sham vs injury expression on day 1. Since many genes are differentially expressed due to injury on day 1, this approach really muddles interpretation of the data.

Since what I really want is to use the control time 0 data as a baseline measurement for both the sham and injury conditions I decided to duplicate the control data in the count matrix (3 biological replicates of control time 0 -> 2  x 3 replicates of control time 0) and re-name 3 as "sham time 0" and 3 as "injury time 0". I'm comfortable with doing this in terms of the biology because I genuinely expect the sham and injury groups to be identical before the experiment. Statistically, I don't know if this is appropriate. I consulted with my local biostatistician and his suggestion was to contact DESeq authors to see how the program would handle this. 

Questions:

Is duplicating the control data an appropriate solution or does it introduce bias during the DESeq normalization?

Is there a better way to address the problem? 

 

Thank you!

deseq2 timecourse multiple factor design RNAseq analysis differential gene expression • 4.2k views
ADD COMMENT
1
Entering edit mode
Gavin Kelly ▴ 690
@gavin-kelly-6944
Last seen 4.7 years ago
United Kingdom / London / Francis Crick…

My approach would be to drop the 'control' for the purposes of detecting differential expression - your second approach.  You can construct a contrast that will detect changes at the later timepoints between sham vs injury, and you can do this either in absolute terms (so injury_15 vs sham_15), or over and above the earliest common time point (injury15:sham15 - injury1:sham1).   It's not quite clear which you want, but duplicating the control to be a proxy time0 in both sham and injury would lead to it cancelling out in any of the obvious comparisons between injury and sham.

As you say the many differences at day 1 confuses things, it seems you're probably wanting the interaction approach, so setting up a wald test with a model of  ~ condition + days +condition:days, and looking at the interaction terms using 'results' (and assuming day1 is the baseline level) will give you genes that are 'more differential' at later days that they are at time 1 (I put 'more' in quotes, as it could be less differential, or differential in the opposite direction...) - and the day coefficient would give you the problematic differential-at-day-0.

One use of the control group might be as a baseline for effect sizes - so if you set up a `group=paste(condition,day)` condition, and relevelled it so control was the baseline, you could use a ~group  model to estimate lfc of all day_condition combinations against the control.  So you could then say (using the 'drop control' approach)  gene X is differential at day7 over and above the day1 effect, and (using the 'group' approach') it shows a fold change against the control of +1.6 in the sham and +2.9 in the injury.

You could use the 'group' approach for direct pairwise comparisons, but you'd need to think carefully about how this was compensating for the muddling day1 changes.

If you duplicate the control group (which probably has lower within-group variance than the other conditions), you'll be biasing the pooled variance estimates towards this, which is not a good thing.  You may want to point this out to your local biostatistician ;)

 

ADD COMMENT
0
Entering edit mode

Removed my first response because I realized it was answered in your response. I'll go ahead and drop the control_0 data from the analysis and try your first suggestion of constructing a contrast between sham and injury on each day. Thank you!

ADD REPLY
0
Entering edit mode

My solution for now is:

1. Remove the control 0 data

2. Use a ~group model --> injury_1, injury_3, injury_7, etc

3. Use design = ~group and run DESeq() [no test specification]

4. Use contrast to compare conditions on each day, ie: results(dds, contrast=c("group", "injury_1", "sham_1")

You seem to caution against this approach above - could you clarify? My preliminary analysis goal is to identify the genes that are DE at each time point after injury, rather than identifying which genes are different between time points. A direct pairwise comparison would give me this information and I could ask more nuanced questions in the future. 

ADD REPLY
0
Entering edit mode

Yes, if you're preliminary objective is as you say, then the group approach is best, in my opinion.  My hesitancy was due to your Since many genes are differentially expressed due to injury on day 1, this approach really muddles interpretation of the data quote, as these will still persist.  If you're tempted to filter later days results by the day1 differentials, then I'd implore you to at least try a 'contrast=list(c("injury_7", "sham_1"), c("injury1", "sham_7"))' approach, which looks at fold-changes above and beyond those seen at day1, and doesn't suffer from the double exposure to statistical errors that a venn diagram-like approach does.  But as you say, these aren't your immediate concern. 

Another issue in the 'group' approach is shrinkage.  I'd use DESeq(..., betaPrior=FALSE) in this instance (if it isn't already the default setting in your version of DESeq2), because you've got 45 possible pairwise comparisons between the 10 conditions, and so the moderation of the estimation of fold-changes is unlikely to be specific to the comparisons you're doing.

ADD REPLY
0
Entering edit mode

I'm using DESeq2 v 1.16.1 so I believe the default is betaPrior = FALSE so I'll need to use the lfcShrink(). When I do this, the number of significant genes is MUCH higher than results using the unshrunken LFC and I was confused but after reading New function lfcShrink() in DESeq2 and A: DESeq2 lfcShrink() usage of coef vs. usage of contrast it seems like I should "use the p-values from the unshrunk LFC and use the shrunken LFC for visualization or ranking of genes". It looks like the lfcshrink() Wald test p-value is not actually for the groups I'm comparing - is this something I am doing wrong or the default in which case, do I go with the suggestion of using the p-values from unshrunk LFC [calling results with contrast]? Thanks

With lfcshrink():

dds <- DESeq(ddsMatrix) 
res <-results(dds, alpha = 0.05)

resday10.shr <- lfcShrink(dds=dds, contrast=c("group", "injury_10", "sham_10"), res=res)
head(resday10.shr)
summary(resday10.shr)

log2 fold change (MAP): group injury_10 vs sham_10 
Wald test p-value: group injury 7 vs sham 1 
DataFrame with 6 rows and 5 columns
                baseMean log2FoldChange        stat    pvalue      padj
               <numeric>      <numeric>   <numeric> <numeric> <numeric>
FBgn0000003    0.2321838   -0.008192420  0.10395375 0.9172061        NA
FBgn0000008  685.2158454   -0.167364213  0.08979044 0.9284537 0.9668097
FBgn0000014    0.1849456   -0.007123067  0.24597089 0.8057048        NA
FBgn0000015    0.6192357   -0.034028635 -0.91245317 0.3615302        NA
FBgn0000017 2896.5305983    0.041048921 -1.77995270 0.0750837 0.2137181
FBgn0000018  181.4742661    0.024009145  0.35328189 0.7238771 0.8501377

out of 15414 with nonzero total read count
adjusted p-value < 0.05
LFC > 0 (up)     : 927, 6% 
LFC < 0 (down)   : 1169, 7.6% 
outliers [1]     : 18, 0.12% 
low counts [2]   : 4154, 27% 
(mean count < 2)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

With results():

resday10 <- results(dds, contrast=c("group", "injury_10", "sham_10"))
head(resday10.shr)
summary(resday10)

log2 fold change (MLE): group injury_10 vs sham_10 
Wald test p-value: group injury_10 vs sham_10 
DataFrame with 6 rows and 6 columns
                baseMean log2FoldChange      lfcSE       stat    pvalue      padj
               <numeric>      <numeric>  <numeric>  <numeric> <numeric> <numeric>
FBgn0000003    0.2321838    -1.09652239 6.01306430 -0.1823567 0.8553028        NA
FBgn0000008  685.2158454    -0.17680818 0.12409102 -1.4248266 0.1542073 0.8434779
FBgn0000014    0.1849456    -1.09651939 6.45179408 -0.1699557 0.8650450        NA
FBgn0000015    0.6192357    -1.06650350 2.67733931 -0.3983445 0.6903762        NA
FBgn0000017 2896.5305983     0.04195420 0.07752053  0.5412012 0.5883689 0.9769969
FBgn0000018  181.4742661     0.02602603 0.15730572  0.1654487 0.8685908 0.9926741

out of 15414 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)     : 172, 1.1% 
LFC < 0 (down)   : 11, 0.071% 
outliers [1]     : 18, 0.12% 
low counts [2]   : 4451, 29% 
(mean count < 3)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

ADD REPLY
0
Entering edit mode

I've been following this thread as I have some similar data that I'm wrestling with. With regards to the lfcShrink Wald test p-value, I think you passed the wrong DESeqResults object to the res parameter of lfcShrink.

Instead of

resday10.shr <- lfcShrink(dds=dds, contrast=c("group", "injury_10", "sham_10"), res=res)

try

resday10.shr <- lfcShrink(dds=dds, contrast=c("group", "injury_10", "sham_10"), res=resday10)

I suspect that's contributing to the differences in the number of significant genes. In my dataset, the number of sig genes is the same pre- and post-lfcShrink().

ADD REPLY
1
Entering edit mode

You're absolutely right - I passed the wrong DESeqResults object. After correction, the number of sig genes is the same pre and post lfcShrink(). Thank you!

ADD REPLY
0
Entering edit mode

I am working with a similar dataset to the one of the OP. Is there any chance to elaborate more on how to do:

One use of the control group might be as a baseline for effect sizes - so if you set up a group = p ∗ e (condition,day) condition, and relevelled it so control was the baseline, you could use a ~group model to estimate lfc of all day_condition combinations against the control.

ADD REPLY
0
Entering edit mode

The way I'd approach this is to create a model matrix with design of ~condition + day + condition:day, and then remove the column for injury at time 0 -- this makes the assumption that there would be no difference at time 0 for the injury samples. Sometimes it's useful to create these samples to control for such differences, but here it sounds like you don't have them.

Then you can pass this matrix to the full argument of DESeq().

ADD REPLY
0
Entering edit mode

Thank you very much, Michael, for your reply.

How do you remove the column for injury at time 0?

mm <- model.matrix(~ condition + day + condition:day)
# or
mm <- model.matrix(~ condition + day + condition:day, coldata)

Executing the above code does not result in a column for injury at time 0.

I tried to delete the row names corresponding to the samples injury01, injury02, injury02 either from coldata or the matrix itself, but then I am receiving the the model matrix is not full rank error.

Thanks for your help.

ADD REPLY
0
Entering edit mode

One way to tell which column represents the effect I’m referring to is the one that has all zeros in mm.

ADD REPLY
0
Entering edit mode

Thanks again for your help. It must be something that I am doing wrong. Because in my case there are no columns with all zeros.

So, this is my samples list (note that I replicated the control as Sham0 and Injury0)

Sample Replicate Condition Day
Sham_0_1         1    Sham  0
Sham_0_2         2    Sham  0
Sham_0_3         3    Sham  0
Sham_1_1         1    Sham  1
Sham_1_2         2    Sham  1
Sham_1_3         3    Sham  1
Sham_3_1         1    Sham  3
Sham_3_2         2    Sham  3
Sham_3_3         3    Sham  3
Sham_7_1         1    Sham  7
Sham_7_2         2    Sham  7
Sham_7_3         3    Sham  7
Injury_0_1         1    Injury  0
Injury_0_2         2    Injury  0
Injury_0_3         3    Injury  0
Injury_1_1         1    Injury  1
Injury_1_2         2    Injury  1
Injury_1_3         3    Injury  1
Injury_3_1         1    Injury  3
Injury_3_2         2    Injury  3
Injury_3_3         3    Injury  3
Injury_7_1         1    Injury  7
Injury_7_2         2    Injury  7
Injury_7_3         3    Injury  7

Then, I used

Condition = samples$Conditon
Day = samples$Day

Then I constructed the following model.matrix

mm <- model.matrix(~ Condition + Day + Condition:Day, colData)

where colData is just a data frame of the sample list above.

Is there anything obvious that I am missing?

Also, would you recommend the same strategy if we considered the control at day 0 to be different from both the Sham samples and the Injury samples? In other words, if the control is considered as 'intact' but that the Sham group has an operation and the Injury group has an operation and injury.

ADD REPLY
0
Entering edit mode

What threw me off is when you described your dataset as similar to the OP. Yours is just like the example in the workflow and you can follow that code, and ignore this thread.

The OP is different in a critical way for analysis in that: “control at time 0 is meant to be a baseline measurement for both the sham and injury conditions”.

ADD REPLY
0
Entering edit mode

Well, that is exactly what I have. Control (time 0) is the baseline measurement for both conditions. I don't have two controls (I just duplicated it as Sham0 and Injury0). Then Sham is treated with surgery, and Injury is treated with surgery and injury.

Day0               Control
                 /          \
         surgery only   surgery + injury
Day1         Sham           Injury
Day3         Sham           Injury
Day7         Sham           Injury
ADD REPLY
0
Entering edit mode

Code them both as sham day 0 and follow the advice in the thread then.

ADD REPLY
0
Entering edit mode

Many thanks, Michael, once again for your help.

I have been enjoying some background reading to avoid asking further questions (without any clue) and here is the complete code which I arrived at.

# construct the samples list
Replicate  <- factor(rep(1:3,times=8))

Day    <- factor(rep(c("Day00","Day03","Day07","Day15"),each=3))
Day    <- relevel(Day,"Day03")

Condition  <- factor(rep(c("Sham","Injury"),each=12))
Condition <- relevel(Condition, "Sham")

Sample  <- paste(Condition, Day, Replicate, sep="_")

samples_df <- data.frame(Sample, Replicate, Condition, Day)
as.data.frame(samples_df)

# remove samples related to Injury 0 while keeping Sham 0
samples_df_red <- samples_df[-c(13:15),]
as.data.frame(samples_df_red)

# construct a DESeqDataSet object
at_dds = DESeqDataSetFromMatrix(read_counts,
                samples_df_red,
                design=~Condition + Day,
                                rowRanges = read_ranges)

# construct a custom matrix
mm <- model.matrix(~Condition + Day + Condition:Day, samples_df_red)

# check which column in the matrix is all zeros
all.zero <- apply(mm, 2, function(x) all(x==0))
# get its index
idx <- which(all.zero)
# delete the column with all zeros from the matrix
mm = mm[, -idx]
# run DESeq()
dds <- DESeq(at_dds, full=mm)

By the way, if I do design(dds) then it gives ~Condition + Day and not the custom matrix model ~Condition + Day + Condition:Day.

I hope this is how we can go about to solve this unbalanced experiment (at day0).

But I am still wondering how would you do a group = p ∗ e (condition,day) condition which Gavin suggested above. What does p and e mean here? I would love to try the other suggestions as well. :)

ADD REPLY
0
Entering edit mode

It's ok that design(dds) gives a formula. DESeq prioritizes what's given to full and only uses design(dds) if nothing is provided in the function call. FWIW You can also provide a matrix to design(dds) if you like.

As far as choice among various designs for your experiment, that goes beyond the level of software support I have time to provide and I'd recommend collaborating with a statistician to work through these questions. Here on the support site I'm primarily trying to solve software related questions and issues, whereas there are limitless statistical analysis and design questions which I can't commit myself to helping to solve.

ADD REPLY
0
Entering edit mode

I am sure that everyone appreciates both the time and the efforts you're putting here. So, thank you for this. I totally understand that you might not have the time to comment on specific questions beyond the level of software support.

For me personally, I really think that it is useful to discuss different approaches of solving a problem and by leaving the question open, I was hoping for Gavin (or anyone else) to give some hints.

ADD REPLY
0
Entering edit mode

It turns out that the rendering system of the Bioconductor support site for the LaTex equations, at least on my system, is messing up the format.

What appears to be group = p *e (condition,day) is actually group=paste(condition,day).

ADD REPLY

Login before adding your answer.

Traffic: 644 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6