Question

DESeq2: how to identify confounding variable in the design matrix

0

Entering edit mode

lina.faller ▴ 10

@linafaller-9082

Last seen 7.6 years ago

Hi all,

I am having a hard time identifying confounding variables in my experimental design. As soon as I add the strain:time interaction, the model does not work anymore.

Any advice would be appreciated!

Edit to include more information about the experimental setup:

There are two different yeast strains. Strain 1 is grown in reactors 1-4, and Strain 2 is grown in reactors 5-8.
There are three possible treatments options (0, 2.5, and 5) and samples were taken at different time points.
Reactors 1 and 2 were control reactors for strain 1 and reactors 5 and 6 were control reactors for strain 2. Treatment 2.5 was added to reactor 3 for strain 1 and reactor 7 for strain 2. Treatment 5 was added to reactor 4 for strain 1 and reactor 8 for strain 2.

Ultimately, we want to compare just the two strains, so I started with model = formula(~ strain). This worked, so I added in time, and then treatment (model = formula(~ time + strain + treatment)) which worked too, but I think there is an interaction between strain and time. Once I added +strain:time in, it broke.

> model = formula(~ time + strain + treatment + strain:time)
> dds <- DESeqDataSetFromTximport(data, colData = metadata, design = model)
Error in checkFullRank(modelMatrix) : 
  the model matrix is not full rank, so the model cannot be fit as specified.
  Levels or combinations of levels without any samples have resulted in
  column(s) of zeros in the model matrix.

  Please read the vignette section 'Model matrix not full rank':

  vignette('DESeq2')

> metadata
         strain growth treatment reactor  time          type
3002329 strain1    exp         0       1  15.1  pretreatment
3002330 strain1    exp         0       1  15.5    5 min post
3002331 strain1    exp         0       1  15.7   15 min post
3002332 strain1    exp         0       1 15.98   30 min post
3002333 strain1    exp         0       1  17.5      2hr post
3002334 strain1    exp         0       1  22.7 7hr10min post
3002335 strain1   slow         0       1   112  pretreatment
3002336 strain1   slow         0       1 135.1  pretreatment
3002337 strain1    exp         0       2  15.1  pretreatment
3002338 strain1    exp         0       2  15.5    5 min post
3002339 strain1    exp         0       2  15.7   15 min post
3002340 strain1    exp         0       2 15.98   30 min post
3002341 strain1    exp         0       2  17.5      2hr post
3002342 strain1   slow         0       2   112  pretreatment
3002343 strain1   slow         0       2 135.1  pretreatment
3002344 strain1   slow         0       2 135.3    5 min post
3002345 strain1    exp         0       3  15.1  pretreatment
3002346 strain1    exp       2.5       3  15.5    5 min post
3002347 strain1    exp       2.5       3  15.7   15 min post
3002348 strain1    exp       2.5       3 15.98   30 min post
3002349 strain1    exp       2.5       3  17.5      2hr post
3002350 strain1    exp       2.5       3  22.7 7hr10min post
3002351 strain1    exp       2.5       3   112     late time
3002352 strain1    exp       2.5       3 160.4     late time
3002353 strain1    exp         0       4  15.1  pretreatment
3002354 strain1    exp         5       4  15.5    5 min post
3002355 strain1    exp         5       4  15.7   15 min post
3002356 strain1    exp         5       4 15.98   30 min post
3002357 strain1    exp         5       4  17.5      2hr post
3002358 strain1    exp         5       4  22.7 7hr10min post
3002359 strain1    exp         5       4   112     late time
3002360 strain1    exp         5       4 160.4     late time
3002361 strain2    exp         0       5  15.1  pretreatment
3002362 strain2    exp         0       5  15.5    5 min post
3002363 strain2    exp         0       5  15.7   15 min post
3002364 strain2    exp         0       5 15.98   30 min post
3002365 strain2    exp         0       5  17.5      2hr post
3002366 strain2    exp         0       5  22.7 7hr10min post
3002367 strain2   slow         0       5   112  pretreatment
3002368 strain2   slow         0       5 135.1  pretreatment
3002369 strain2    exp         0       6  15.1  pretreatment
3002370 strain2    exp         0       6  15.5    5 min post
3002371 strain2    exp         0       6  15.7   15 min post
3002372 strain2    exp         0       6 15.98   30 min post
3002373 strain2    exp         0       6  17.5      2hr post
3002374 strain2    exp         0       6  22.7 7hr10min post
3002375 strain2   slow         0       6   112  pretreatment
3002376 strain2   slow         0       6 135.1  pretreatment
3002377 strain2    exp         0       7  15.1  pretreatment
3002378 strain2    exp       2.5       7  15.5    5 min post
3002379 strain2    exp       2.5       7  15.7   15 min post
3002380 strain2    exp       2.5       7 15.98   30 min post
3002381 strain2    exp       2.5       7  17.5      2hr post
3002382 strain2    exp       2.5       7  22.7 7hr10min post
3002383 strain2    exp       2.5       7   112     late time
3002384 strain2    exp       2.5       7 160.4     late time
3002385 strain2    exp         0       8  15.1  pretreatment
3002386 strain2    exp         5       8  15.5    5 min post
3002387 strain2    exp         5       8  15.7   15 min post
3002388 strain2    exp         5       8 15.98   30 min post
3002389 strain2    exp         5       8  17.5      2hr post
3002390 strain2    exp         5       8  22.7 7hr10min post
3002391 strain2    exp         5       8   112     late time
3002392 strain2    exp         5       8 160.4     late time

deseq2 confounders design matrix rnaseq • 2.7k views

ADD COMMENT • link 7.6 years ago lina.faller ▴ 10

score 0 · Answer 1 · 2017-05-01

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 3 days ago

United States

Can you explain a bit about the irregular time grid? There may be better ways to model this than treating time as a factor here, because of the irregular grid. (This is also causing the problem, you have t=160.4 measurements for the treated series but not for untreated.)

Also, can you explain "reactor"? What is common among the samples with the same reactor?

ADD COMMENT • link 7.6 years ago Michael Love 43k

0

Entering edit mode

Thanks for the comment! I added more information about the experimental setup above. I wasn't involved in the experimental design but I will ask the guy who designed it for more information about the irregular time points. I believe it might have to do with the growth phase (either exponential or slow growing). So if the strain was in the slow growing phase, I think they started collecting samples at extra time points.

ADD REPLY • link 7.6 years ago lina.faller ▴ 10

0

Entering edit mode

I don't understand what is represented in the "time" column, which is critical to doing the modeling right. What is the correspondence between the values in these two columns below? Why does 15.1 correspond to pretreatment, etc.

 15.1  pretreatment
 15.5    5 min post
 15.7   15 min post
15.98   30 min post
 17.5      2hr post
 22.7 7hr10min post
  112  pretreatment
135.1  pretreatment
 15.1  pretreatment

...

ADD REPLY • link 7.6 years ago Michael Love 43k

0

Entering edit mode

The time column is in hours (so 15.1 hrs, 15.5 hrs, etc).

I have been ignoring the "type" column for now because it seemed a bit arbitrary to me. As you can see, sometimes "pretreatment" seems to correspond to 15.1 hrs and sometimes it corresponds with 112 hours.

For the purpose of the design, do you think it's a good approach to ignore it for now?

ADD REPLY • link 7.6 years ago lina.faller ▴ 10

0

Entering edit mode

Can you check with the people who performed the experiment why the times don't correspond to the information in type. The 112,135,etc look like they are mislabeled in type, but who knows what's correct and what's wrong without checking further. It's best to clear this up before you dive into trying to do the modeling.

ADD REPLY • link 7.6 years ago Michael Love 43k

0

Entering edit mode

Thanks for the feedback! I went back to the experimental guys and they said I should ignore the "type" information for now because it has been assigned in a somewhat arbitrary way. Instead, the time variable is more informative (because someone actually looked at a clock to take this measurement).

I was able to run the model by separating the data into two datasets by strain and then simplifying it in the following way:

model = formula(~ growth + vanillin + time)

I also ended up treating time as a continuous variable (not a factor). Is this an acceptable approach?

I am hoping that by removing the strain variable the data will be easier to interpret, but ultimately I would like to understand how to incorporate strain back into the model.

ADD REPLY • link 7.6 years ago lina.faller ▴ 10

0

Entering edit mode

This is quite a complex experimental design, and I would recommend that you partner with a statistician to do the analysis. One reason for this is that you will have to make some decisions about how to model expression over time, because of the irregular grid. With DESeq2, you can incorporate any kind of R functions of numeric variables, e.g. splines, within the design formula. This is how I would approach it, but I or anyone else would have to sit down and spend some time exploring the data to figure out what family of functions are reasonable, and I unfortunately can't offer more detailed advice here. In addition it's not trivial how to account for the random effect of reactor which is nested within the combination of strain x treatment. There are some approaches to this but it's something someone would have to sit down for a while and work out, and I don't have enough time to answer here (there is some discussion of this in the DESeq2 vignette section on designs that produce reduced rank model matrices).

Some other things to work on:

The treatment group in the pre-treatment time point should still be given a treatment value. E.g. the first sample in reactor 3 should have treatment = 2.5

I don't have any idea on how to incorporate growth into a meaningful model, as it is a post-hoc designation of some the samples within a time series, rather than part of the fixed experimental design.

ADD REPLY • link 7.6 years ago Michael Love 43k