Hi all,
I am having a hard time identifying confounding variables in my experimental design. As soon as I add the strain:time interaction, the model does not work anymore.
Any advice would be appreciated!
Edit to include more information about the experimental setup:
- There are two different yeast strains. Strain 1 is grown in reactors 1-4, and Strain 2 is grown in reactors 5-8.
- There are three possible treatments options (0, 2.5, and 5) and samples were taken at different time points.
- Reactors 1 and 2 were control reactors for strain 1 and reactors 5 and 6 were control reactors for strain 2. Treatment 2.5 was added to reactor 3 for strain 1 and reactor 7 for strain 2. Treatment 5 was added to reactor 4 for strain 1 and reactor 8 for strain 2.
Ultimately, we want to compare just the two strains, so I started with model = formula(~ strain). This worked, so I added in time, and then treatment (model = formula(~ time + strain + treatment)) which worked too, but I think there is an interaction between strain and time. Once I added +strain:time in, it broke.
> model = formula(~ time + strain + treatment + strain:time) > dds <- DESeqDataSetFromTximport(data, colData = metadata, design = model) Error in checkFullRank(modelMatrix) : the model matrix is not full rank, so the model cannot be fit as specified. Levels or combinations of levels without any samples have resulted in column(s) of zeros in the model matrix. Please read the vignette section 'Model matrix not full rank': vignette('DESeq2')
> metadata strain growth treatment reactor time type 3002329 strain1 exp 0 1 15.1 pretreatment 3002330 strain1 exp 0 1 15.5 5 min post 3002331 strain1 exp 0 1 15.7 15 min post 3002332 strain1 exp 0 1 15.98 30 min post 3002333 strain1 exp 0 1 17.5 2hr post 3002334 strain1 exp 0 1 22.7 7hr10min post 3002335 strain1 slow 0 1 112 pretreatment 3002336 strain1 slow 0 1 135.1 pretreatment 3002337 strain1 exp 0 2 15.1 pretreatment 3002338 strain1 exp 0 2 15.5 5 min post 3002339 strain1 exp 0 2 15.7 15 min post 3002340 strain1 exp 0 2 15.98 30 min post 3002341 strain1 exp 0 2 17.5 2hr post 3002342 strain1 slow 0 2 112 pretreatment 3002343 strain1 slow 0 2 135.1 pretreatment 3002344 strain1 slow 0 2 135.3 5 min post 3002345 strain1 exp 0 3 15.1 pretreatment 3002346 strain1 exp 2.5 3 15.5 5 min post 3002347 strain1 exp 2.5 3 15.7 15 min post 3002348 strain1 exp 2.5 3 15.98 30 min post 3002349 strain1 exp 2.5 3 17.5 2hr post 3002350 strain1 exp 2.5 3 22.7 7hr10min post 3002351 strain1 exp 2.5 3 112 late time 3002352 strain1 exp 2.5 3 160.4 late time 3002353 strain1 exp 0 4 15.1 pretreatment 3002354 strain1 exp 5 4 15.5 5 min post 3002355 strain1 exp 5 4 15.7 15 min post 3002356 strain1 exp 5 4 15.98 30 min post 3002357 strain1 exp 5 4 17.5 2hr post 3002358 strain1 exp 5 4 22.7 7hr10min post 3002359 strain1 exp 5 4 112 late time 3002360 strain1 exp 5 4 160.4 late time 3002361 strain2 exp 0 5 15.1 pretreatment 3002362 strain2 exp 0 5 15.5 5 min post 3002363 strain2 exp 0 5 15.7 15 min post 3002364 strain2 exp 0 5 15.98 30 min post 3002365 strain2 exp 0 5 17.5 2hr post 3002366 strain2 exp 0 5 22.7 7hr10min post 3002367 strain2 slow 0 5 112 pretreatment 3002368 strain2 slow 0 5 135.1 pretreatment 3002369 strain2 exp 0 6 15.1 pretreatment 3002370 strain2 exp 0 6 15.5 5 min post 3002371 strain2 exp 0 6 15.7 15 min post 3002372 strain2 exp 0 6 15.98 30 min post 3002373 strain2 exp 0 6 17.5 2hr post 3002374 strain2 exp 0 6 22.7 7hr10min post 3002375 strain2 slow 0 6 112 pretreatment 3002376 strain2 slow 0 6 135.1 pretreatment 3002377 strain2 exp 0 7 15.1 pretreatment 3002378 strain2 exp 2.5 7 15.5 5 min post 3002379 strain2 exp 2.5 7 15.7 15 min post 3002380 strain2 exp 2.5 7 15.98 30 min post 3002381 strain2 exp 2.5 7 17.5 2hr post 3002382 strain2 exp 2.5 7 22.7 7hr10min post 3002383 strain2 exp 2.5 7 112 late time 3002384 strain2 exp 2.5 7 160.4 late time 3002385 strain2 exp 0 8 15.1 pretreatment 3002386 strain2 exp 5 8 15.5 5 min post 3002387 strain2 exp 5 8 15.7 15 min post 3002388 strain2 exp 5 8 15.98 30 min post 3002389 strain2 exp 5 8 17.5 2hr post 3002390 strain2 exp 5 8 22.7 7hr10min post 3002391 strain2 exp 5 8 112 late time 3002392 strain2 exp 5 8 160.4 late time
Thanks for the comment! I added more information about the experimental setup above. I wasn't involved in the experimental design but I will ask the guy who designed it for more information about the irregular time points. I believe it might have to do with the growth phase (either exponential or slow growing). So if the strain was in the slow growing phase, I think they started collecting samples at extra time points.
I don't understand what is represented in the "time" column, which is critical to doing the modeling right. What is the correspondence between the values in these two columns below? Why does 15.1 correspond to pretreatment, etc.
...
The time column is in hours (so 15.1 hrs, 15.5 hrs, etc).
I have been ignoring the "type" column for now because it seemed a bit arbitrary to me. As you can see, sometimes "pretreatment" seems to correspond to 15.1 hrs and sometimes it corresponds with 112 hours.
For the purpose of the design, do you think it's a good approach to ignore it for now?
Can you check with the people who performed the experiment why the times don't correspond to the information in type. The 112,135,etc look like they are mislabeled in type, but who knows what's correct and what's wrong without checking further. It's best to clear this up before you dive into trying to do the modeling.
Thanks for the feedback! I went back to the experimental guys and they said I should ignore the "type" information for now because it has been assigned in a somewhat arbitrary way. Instead, the time variable is more informative (because someone actually looked at a clock to take this measurement).
I was able to run the model by separating the data into two datasets by strain and then simplifying it in the following way:
I also ended up treating time as a continuous variable (not a factor). Is this an acceptable approach?
I am hoping that by removing the strain variable the data will be easier to interpret, but ultimately I would like to understand how to incorporate strain back into the model.
This is quite a complex experimental design, and I would recommend that you partner with a statistician to do the analysis. One reason for this is that you will have to make some decisions about how to model expression over time, because of the irregular grid. With DESeq2, you can incorporate any kind of R functions of numeric variables, e.g. splines, within the design formula. This is how I would approach it, but I or anyone else would have to sit down and spend some time exploring the data to figure out what family of functions are reasonable, and I unfortunately can't offer more detailed advice here. In addition it's not trivial how to account for the random effect of reactor which is nested within the combination of strain x treatment. There are some approaches to this but it's something someone would have to sit down for a while and work out, and I don't have enough time to answer here (there is some discussion of this in the DESeq2 vignette section on designs that produce reduced rank model matrices).
Some other things to work on:
The treatment group in the pre-treatment time point should still be given a treatment value. E.g. the first sample in reactor 3 should have treatment = 2.5
I don't have any idea on how to incorporate growth into a meaningful model, as it is a post-hoc designation of some the samples within a time series, rather than part of the fixed experimental design.