Question: With DESeq2 "Not full rank" Error with design ~ line + time + condition
0
4.7 years ago by
Netherlands
c.lieftink@nki.nl20 wrote:

Hi all,

I would like to use DESeq2 for the analysis of a pooled shRNA screen to identify mutation specific lethals. The design of the experiment is a number of mutant cell lines and a number of wildtype cell lines. For every cell line there is a t0 measurement and t1 measurement. Interesting hairpins/genes are those that give significant more killing in the mutant cell lines, compare to the wt cell lines. Because each cell line has it's own t0, the data need to be paired wise analysed.

I tried to do it with DESeq2 in the following design:
formula(~ line + time + condition)

A simple example:
d<-data.frame(w_l1_t0=c(4,8),w_l2_t0=c(5,9),w_l1_t1=c(4,8),w_l2_t1=c(5,9),m_l3_t0=c(4,8),m_l4_t0=c(5,9),m_l3_t1=c(1,2), m_l4_t1=c(1,2))
row.names(d) <- c("hp1","hp2")

countData <- as.matrix(d)
colData = data.frame(condition=c("w","w","w","w","m","m","m","m"),
time=c("0","0","1","1","0","0","1","1"),
line=c("1","2","1","2","3","4","3","4"))

dds <- DESeqDataSetFromMatrix(countData = countData,
colData = colData,
design = ~ line + time + condition)

However I then get the error:
Error in DESeqDataSet(se, design = design, ignoreRank) :
the model matrix is not full rank, so the model cannot be fit as specified.
one or more variables or interaction terms in the design formula
are linear combinations of the others and must be removed

I assume this is because cell line and condition are linked. Is there a way to do the above analysis in DESeq2? I don't know the exact meaning of the parameter ignoreRank, but would setting it to TRUE be an option?

Best regards,

Cor Lieftink

P.s.  I thought it was solved, after using an extended dataset with 15 mt and 15 wt cell lines the error was gone. However I had there numbered the lines within the subset of wildtype and mutant, which should be (I think) continuous over the two sets. The error than re-emerge.

deseq deseq2 • 10.0k views
modified 4.6 years ago by Michael Love25k • written 4.7 years ago by c.lieftink@nki.nl20
Answer: With DESeq2 "Not full rank" Error with design ~ line + time + condition
1
4.6 years ago by
Michael Love25k
United States
Michael Love25k wrote:

hi Cor,

You're right, the issue is that the lines are nested within condition. Here's a similar post which I gave a solution for: A: DESeq2 paired multifactor test

essentially, you should do the same steps of creating a vector "nested.line", which just distinguishes the lines within a condition. So for the above, it would be c("1","2","1","2","1","2","1","2").

Then, I think you want to use a design of ~ condition + time + condition:nested.line + condition:time. This will control for the line specific effects. And it sounds like you want to test the condition:time effect. This tests if the mutant cell lines experience a change at time 1 compared to time 0 which is different than the one for wildtype. you can use the following call:

dds = DESeq(dds, modelMatrixType="standard")
results(dds, name="conditionm.time1")

(the modelMatrixType part is only necessary so that the interaction can be tested by a single coefficient, even if you add more cell lines.)

Hello Michael,

Thank you for your quick response!

Best regards,

Cor

Hi Michael,

Sorry to bother you again. Your solution works fine when I have equal number of mutant and wildtype cells. However if not, I get  again the error "the model matrix is not full rank". Do you know of any alternative design to deal with that as well?

Best regards,

Cor

hi Cor,

Can you post the as.data.frame(colData(dds)), including the nested.line column, and the design you are using, for an example when you are not full rank?

Hi Michael,

It runs fine with 3 replicates in each status condition, however adding a fourth in only the 0 (=wildtype) condition gives the error not full rank.

> as.data.frame(colData(dds))
status rep time
m0.EFO21.t0         0   1    0
m0.EFO21.t1         0   1    1
m0.LN215.t0         0   3    0
m0.LN215.t1         0   3    1
m0.LN382.t0         0   4    0
m0.LN382.t1         0   4    1
m1.A2780.t0         1   1    0
m1.A2780.t1         1   1    1
m1.CAL51.t0         1   2    0
m1.CAL51.t1         1   2    1
m1.EFO27.t0         1   3    0
m1.EFO27.t1         1   3    1

Design used:

design(dds) <- formula(~ status + time + status:rep + status:time)

Best regards,

Cor

I see now what is causing the problem, which is something i've been working to make things a bit easier in the development branch of DESeq2. Basically, R's model.matrix() function is great when it helps the user easily express experiment designs, but annoying when it breaks down. This is why i've added support for users to tweak the model matrix and give this to the DESeq() function.

What's happening is the extra level of rep=4 creates a column of all zeros in the model matrix, because model.matrix() function tries to make the interaction of rep=4 and status=1.

There are two solutions:

1) use DESeq2 version >= 1.7, which is in the development branch, so would require getting the development version of R/Bioconductor. Here, I've allowed the user to supply custom a model matrix to DESeq(). You would form the model.matrix yourself:

mm = model.matrix(~ status + time + status:rep + status:time, colData(dds))

And then remove the column which has all zeros, and provide this modified 'mm' to DESeq():

dds = DESeq(dds, full=mm, betaPrior=FALSE)

2) or add variables accounting for the cell line effects to the colData(dds) manually. For the above colData, you would need to add some variables like so:

dds$status0rep2 <- as.numeric(dds$status == 0 & dds\$rep == 2)

You'd need to do this for status0rep2, status0rep3, status0rep4, status1rep2, status1rep3. And then add all of these to the design: ~status + time + status0rep2 + ... + status:time

Hi Michael,

Thanks again for your quick response! I downloaded the development version, and was now able to process the data also in case of different number of replicates for the two status conditions!

Best regards,

Cor

Hi Michael,

Yesterday I ran an analysis of 15 mutant cell lines against 226 wildtype/unknown cell lines for 1200 hairpins. It took 8 hours on my home computer, but I have the results! :-) Thanks again.

Cor

If possible could you send the dataset to me (maintainer("DESeq2")). That's slower than I've seen before.

For a 2 group comparison with simulated data, 250 samples takes 40 seconds on my laptop:

> dds = makeExampleDESeqDataSet(m=250,n=1200)
> system.time({ dds = DESeq(dds) })
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
user  system elapsed
39.901   0.036  39.903

hi Cor,

I'm continuing our conversation on the site, but won't mention any specifics.

I was confused about why it would take so long, but I should have paid attention in your comment that you want to account for 100s of cell lines in the model with terms for each. With 100s of cell lines, the model matrix is growing to have many hundreds of columns. And the GLM solution requires iterations to converge to the solution, so this involves large matrix multiplications.

I would actually recommend using voom and limma if you have 100s of columns in the model matrix. This would speed things up a lot, because the solution in the linear case can be easily calculated, even when the model matrix has 100s of columns . You can also directly use normalized CPM (counts per million) directly with voom and limma, without having to try to convert back to raw counts.

Hi Michael,

Thanks a lot!

Best regards,

Cor