Search
Question: Time course analysis in DESeq2: handling common baseline for two experimental conditions
2
gravatar for c.nicole
7 weeks ago by
c.nicole30
Philadelphia
c.nicole30 wrote:

Hello,

I'm trying to perform a time course analysis of RNAseq data to identify genes that are differentially expressed as a function of experimental condition/how they change over time however, I have an unbalanced experimental design. I've read through the documentation, vignettes and various posts and tried the most straightforward work arounds (not a biostatistician; new to RNAseq analysis) but none fully addressed my experimental set-up/analysis needs.

In the study, animals are subjected to sham or injury and collected at 1, 3, 7, 10 or 15 days post-injury. There is also a "control" condition in which samples are collected before injury (time 0). Animals are age/sex/genotype matched so the control at time 0 is meant to be a baseline measurement for both the sham and injury conditions. There are 3 biological replicates for each condition x time and each replicate is N = 20 animals.

experimental condition time (days post-injury)
control 0
sham 1
sham 3
sham 7
sham 10
sham 15
injury 1
injury 3
injury 7
injury 10
injury 15

Using the example at: http://www.bioconductor.org/help/workflows/rnaseqGene/#time, I used the LRT to identify differentially expressed genes using the following code:

ddsMatrix <- DESeqDataSetFromMatrix(countData = countdata, colData = coldata, design = ~ condition + days + condition:days)
ddsTC <- DESeq(ddsMatrix, test="LRT", reduced = ~ condition + days)

Not surprisingly, I ran into the “Model matrix not full rank” error because of missing samples (no control for time 1-15 days; no sham/injury for time 0). I tried the following:

1. re-name the control samples as sham then delete the empty zero columns - doesn't work because the sham time 0 data is the reference level/intercept so there is no zeros column to delete in the model matrix.

2. exclude the control data from the analysis - this removes the error and allows for DE analysis however, it seems like the sham vs injury day 1 data becomes the "time 0" reference which is problematic because significantly DE genes for later days (3, 7, 10, 15) are determined based on the sham vs injury expression on day 1. Since many genes are differentially expressed due to injury on day 1, this approach really muddles interpretation of the data.

Since what I really want is to use the control time 0 data as a baseline measurement for both the sham and injury conditions I decided to duplicate the control data in the count matrix (3 biological replicates of control time 0 -> 2  x 3 replicates of control time 0) and re-name 3 as "sham time 0" and 3 as "injury time 0". I'm comfortable with doing this in terms of the biology because I genuinely expect the sham and injury groups to be identical before the experiment. Statistically, I don't know if this is appropriate. I consulted with my local biostatistician and his suggestion was to contact DESeq authors to see how the program would handle this. 

Questions:

Is duplicating the control data an appropriate solution or does it introduce bias during the DESeq normalization?

Is there a better way to address the problem? 

 

Thank you!

ADD COMMENTlink modified 7 weeks ago by Gavin Kelly510 • written 7 weeks ago by c.nicole30
1
gravatar for Gavin Kelly
7 weeks ago by
Gavin Kelly510
United Kingdom / London / Francis Crick Institute
Gavin Kelly510 wrote:

My approach would be to drop the 'control' for the purposes of detecting differential expression - your second approach.  You can construct a contrast that will detect changes at the later timepoints between sham vs injury, and you can do this either in absolute terms (so injury_15 vs sham_15), or over and above the earliest common time point (injury15:sham15 - injury1:sham1).   It's not quite clear which you want, but duplicating the control to be a proxy time0 in both sham and injury would lead to it cancelling out in any of the obvious comparisons between injury and sham.

As you say the many differences at day 1 confuses things, it seems you're probably wanting the interaction approach, so setting up a wald test with a model of  ~ condition + days +condition:days, and looking at the interaction terms using 'results' (and assuming day1 is the baseline level) will give you genes that are 'more differential' at later days that they are at time 1 (I put 'more' in quotes, as it could be less differential, or differential in the opposite direction...) - and the day coefficient would give you the problematic differential-at-day-0.

One use of the control group might be as a baseline for effect sizes - so if you set up a `group=paste(condition,day)` condition, and relevelled it so control was the baseline, you could use a ~group  model to estimate lfc of all day_condition combinations against the control.  So you could then say (using the 'drop control' approach)  gene X is differential at day7 over and above the day1 effect, and (using the 'group' approach') it shows a fold change against the control of +1.6 in the sham and +2.9 in the injury.

You could use the 'group' approach for direct pairwise comparisons, but you'd need to think carefully about how this was compensating for the muddling day1 changes.

If you duplicate the control group (which probably has lower within-group variance than the other conditions), you'll be biasing the pooled variance estimates towards this, which is not a good thing.  You may want to point this out to your local biostatistician ;)

 

ADD COMMENTlink written 7 weeks ago by Gavin Kelly510

Removed my first response because I realized it was answered in your response. I'll go ahead and drop the control_0 data from the analysis and try your first suggestion of constructing a contrast between sham and injury on each day. Thank you!

ADD REPLYlink modified 7 weeks ago • written 7 weeks ago by c.nicole30

My solution for now is:

1. Remove the control 0 data

2. Use a ~group model --> injury_1, injury_3, injury_7, etc

3. Use design = ~group and run DESeq() [no test specification]

4. Use contrast to compare conditions on each day, ie: results(dds, contrast=c("group", "injury_1", "sham_1")

You seem to caution against this approach above - could you clarify? My preliminary analysis goal is to identify the genes that are DE at each time point after injury, rather than identifying which genes are different between time points. A direct pairwise comparison would give me this information and I could ask more nuanced questions in the future. 

ADD REPLYlink written 7 weeks ago by c.nicole30

Yes, if you're preliminary objective is as you say, then the group approach is best, in my opinion.  My hesitancy was due to your Since many genes are differentially expressed due to injury on day 1, this approach really muddles interpretation of the data quote, as these will still persist.  If you're tempted to filter later days results by the day1 differentials, then I'd implore you to at least try a 'contrast=list(c("injury_7", "sham_1"), c("injury1", "sham_7"))' approach, which looks at fold-changes above and beyond those seen at day1, and doesn't suffer from the double exposure to statistical errors that a venn diagram-like approach does.  But as you say, these aren't your immediate concern. 

Another issue in the 'group' approach is shrinkage.  I'd use DESeq(..., betaPrior=FALSE) in this instance (if it isn't already the default setting in your version of DESeq2), because you've got 45 possible pairwise comparisons between the 10 conditions, and so the moderation of the estimation of fold-changes is unlikely to be specific to the comparisons you're doing.

ADD REPLYlink written 7 weeks ago by Gavin Kelly510

I'm using DESeq2 v 1.16.1 so I believe the default is betaPrior = FALSE so I'll need to use the lfcShrink(). When I do this, the number of significant genes is MUCH higher than results using the unshrunken LFC and I was confused but after reading New function lfcShrink() in DESeq2 and A: DESeq2 lfcShrink() usage of coef vs. usage of contrast it seems like I should "use the p-values from the unshrunk LFC and use the shrunken LFC for visualization or ranking of genes". It looks like the lfcshrink() Wald test p-value is not actually for the groups I'm comparing - is this something I am doing wrong or the default in which case, do I go with the suggestion of using the p-values from unshrunk LFC [calling results with contrast]? Thanks

With lfcshrink():

dds <- DESeq(ddsMatrix) 
res <-results(dds, alpha = 0.05)

resday10.shr <- lfcShrink(dds=dds, contrast=c("group", "injury_10", "sham_10"), res=res)
head(resday10.shr)
summary(resday10.shr)

log2 fold change (MAP): group injury_10 vs sham_10 
Wald test p-value: group injury 7 vs sham 1 
DataFrame with 6 rows and 5 columns
                baseMean log2FoldChange        stat    pvalue      padj
               <numeric>      <numeric>   <numeric> <numeric> <numeric>
FBgn0000003    0.2321838   -0.008192420  0.10395375 0.9172061        NA
FBgn0000008  685.2158454   -0.167364213  0.08979044 0.9284537 0.9668097
FBgn0000014    0.1849456   -0.007123067  0.24597089 0.8057048        NA
FBgn0000015    0.6192357   -0.034028635 -0.91245317 0.3615302        NA
FBgn0000017 2896.5305983    0.041048921 -1.77995270 0.0750837 0.2137181
FBgn0000018  181.4742661    0.024009145  0.35328189 0.7238771 0.8501377

out of 15414 with nonzero total read count
adjusted p-value < 0.05
LFC > 0 (up)     : 927, 6% 
LFC < 0 (down)   : 1169, 7.6% 
outliers [1]     : 18, 0.12% 
low counts [2]   : 4154, 27% 
(mean count < 2)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

With results():

resday10 <- results(dds, contrast=c("group", "injury_10", "sham_10"))
head(resday10.shr)
summary(resday10)

log2 fold change (MLE): group injury_10 vs sham_10 
Wald test p-value: group injury_10 vs sham_10 
DataFrame with 6 rows and 6 columns
                baseMean log2FoldChange      lfcSE       stat    pvalue      padj
               <numeric>      <numeric>  <numeric>  <numeric> <numeric> <numeric>
FBgn0000003    0.2321838    -1.09652239 6.01306430 -0.1823567 0.8553028        NA
FBgn0000008  685.2158454    -0.17680818 0.12409102 -1.4248266 0.1542073 0.8434779
FBgn0000014    0.1849456    -1.09651939 6.45179408 -0.1699557 0.8650450        NA
FBgn0000015    0.6192357    -1.06650350 2.67733931 -0.3983445 0.6903762        NA
FBgn0000017 2896.5305983     0.04195420 0.07752053  0.5412012 0.5883689 0.9769969
FBgn0000018  181.4742661     0.02602603 0.15730572  0.1654487 0.8685908 0.9926741

out of 15414 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)     : 172, 1.1% 
LFC < 0 (down)   : 11, 0.071% 
outliers [1]     : 18, 0.12% 
low counts [2]   : 4451, 29% 
(mean count < 3)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

ADD REPLYlink modified 7 weeks ago • written 7 weeks ago by c.nicole30

I've been following this thread as I have some similar data that I'm wrestling with. With regards to the lfcShrink Wald test p-value, I think you passed the wrong DESeqResults object to the res parameter of lfcShrink.

Instead of

resday10.shr <- lfcShrink(dds=dds, contrast=c("group", "injury_10", "sham_10"), res=res)

try

resday10.shr <- lfcShrink(dds=dds, contrast=c("group", "injury_10", "sham_10"), res=resday10)

I suspect that's contributing to the differences in the number of significant genes. In my dataset, the number of sig genes is the same pre- and post-lfcShrink().

ADD REPLYlink written 7 weeks ago by Russ Fraser0
1

You're absolutely right - I passed the wrong DESeqResults object. After correction, the number of sig genes is the same pre and post lfcShrink(). Thank you!

ADD REPLYlink written 7 weeks ago by c.nicole30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 161 users visited in the last hour