Question

With DESeq2 "Not full rank" Error with design ~ line + time + condition

2

Entering edit mode

c.lieftink@nki.nl ▴ 40

@clieftinknkinl-3029

Last seen 9.2 years ago

Netherlands

Hi all,

I would like to use DESeq2 for the analysis of a pooled shRNA screen to identify mutation specific lethals. The design of the experiment is a number of mutant cell lines and a number of wildtype cell lines. For every cell line there is a t0 measurement and t1 measurement. Interesting hairpins/genes are those that give significant more killing in the mutant cell lines, compare to the wt cell lines. Because each cell line has it's own t0, the data need to be paired wise analysed.

I tried to do it with DESeq2 in the following design:
formula(~ line + time + condition)

A simple example:
d<-data.frame(w_l1_t0=c(4,8),w_l2_t0=c(5,9),w_l1_t1=c(4,8),w_l2_t1=c(5,9),m_l3_t0=c(4,8),m_l4_t0=c(5,9),m_l3_t1=c(1,2), m_l4_t1=c(1,2))   
row.names(d) <- c("hp1","hp2")

countData <- as.matrix(d)   
colData = data.frame(condition=c("w","w","w","w","m","m","m","m"),
        time=c("0","0","1","1","0","0","1","1"),
        line=c("1","2","1","2","3","4","3","4"))

dds <- DESeqDataSetFromMatrix(countData = countData,
    colData = colData,
    design = ~ line + time + condition)

However I then get the error:
Error in DESeqDataSet(se, design = design, ignoreRank) :
the model matrix is not full rank, so the model cannot be fit as specified.
one or more variables or interaction terms in the design formula
are linear combinations of the others and must be removed

I assume this is because cell line and condition are linked. Is there a way to do the above analysis in DESeq2? I don't know the exact meaning of the parameter ignoreRank, but would setting it to TRUE be an option?

Thank you in advance for your reply!

Best regards,

Cor Lieftink

P.s. I thought it was solved, after using an extended dataset with 15 mt and 15 wt cell lines the error was gone. However I had there numbered the lines within the subset of wildtype and mutant, which should be (I think) continuous over the two sets. The error than re-emerge.

deseq deseq2 • 26k views

ADD COMMENT • link updated 8 months ago by Erika • 0 • written 9.2 years ago by c.lieftink@nki.nl ▴ 40

score 2 · Answer 1 · 2015-01-28

2

Entering edit mode

Michael Love 41k

@mikelove

Last seen 13 hours ago

United States

hi Cor,

You're right, the issue is that the lines are nested within condition. Here's a similar post which I gave a solution for: A: DESeq2 paired multifactor test

essentially, you should do the same steps of creating a vector "nested.line", which just distinguishes the lines within a condition. So for the above, it would be c("1","2","1","2","1","2","1","2").

Then, I think you want to use a design of ~ condition + time + condition:nested.line + condition:time. This will control for the line specific effects. And it sounds like you want to test the condition:time effect. This tests if the mutant cell lines experience a change at time 1 compared to time 0 which is different than the one for wildtype. you can use the following call:

dds = DESeq(dds, modelMatrixType="standard")
results(dds, name="conditionm.time1")

(the modelMatrixType part is only necessary so that the interaction can be tested by a single coefficient, even if you add more cell lines.)

ADD COMMENT • link 9.2 years ago Michael Love 41k

0

Entering edit mode

Hello Michael,

Thank you for your quick response!

Best regards,

Cor

ADD REPLY • link 9.2 years ago c.lieftink@nki.nl ▴ 40

0

Entering edit mode

Hi Michael,

Sorry to bother you again. Your solution works fine when I have equal number of mutant and wildtype cells. However if not, I get again the error "the model matrix is not full rank". Do you know of any alternative design to deal with that as well?

Best regards,

Cor

ADD REPLY • link 9.2 years ago c.lieftink@nki.nl ▴ 40

0

Entering edit mode

hi Cor,

Can you post the as.data.frame(colData(dds)), including the nested.line column, and the design you are using, for an example when you are not full rank?

ADD REPLY • link 9.2 years ago Michael Love 41k

0

Entering edit mode

Hi Michael,

It runs fine with 3 replicates in each status condition, however adding a fourth in only the 0 (=wildtype) condition gives the error not full rank.

> as.data.frame(colData(dds))
               status rep time
m0.EFO21.t0         0   1    0
m0.EFO21.t1         0   1    1
m0.JHESOAD1.t0      0   2    0
m0.JHESOAD1.t1      0   2    1
m0.LN215.t0         0   3    0
m0.LN215.t1         0   3    1
m0.LN382.t0         0   4    0
m0.LN382.t1         0   4    1
m1.A2780.t0         1   1    0
m1.A2780.t1         1   1    1
m1.CAL51.t0         1   2    0
m1.CAL51.t1         1   2    1
m1.EFO27.t0         1   3    0
m1.EFO27.t1         1   3    1

Design used:

design(dds) <- formula(~ status + time + status:rep + status:time)

Best regards,

Cor

ADD REPLY • link 9.2 years ago c.lieftink@nki.nl ▴ 40

2

Entering edit mode

I see now what is causing the problem, which is something i've been working to make things a bit easier in the development branch of DESeq2. Basically, R's model.matrix() function is great when it helps the user easily express experiment designs, but annoying when it breaks down. This is why i've added support for users to tweak the model matrix and give this to the DESeq() function.

What's happening is the extra level of rep=4 creates a column of all zeros in the model matrix, because model.matrix() function tries to make the interaction of rep=4 and status=1.

There are two solutions:

1) use DESeq2 version >= 1.7, which is in the development branch, so would require getting the development version of R/Bioconductor. Here, I've allowed the user to supply custom a model matrix to DESeq(). You would form the model.matrix yourself:

mm = model.matrix(~ status + time + status:rep + status:time, colData(dds))

And then remove the column which has all zeros, and provide this modified 'mm' to DESeq():

dds = DESeq(dds, full=mm, betaPrior=FALSE)

2) or add variables accounting for the cell line effects to the colData(dds) manually. For the above colData, you would need to add some variables like so:

dds$status0rep2 <- as.numeric(dds$status == 0 & dds$rep == 2)

You'd need to do this for status0rep2, status0rep3, status0rep4, status1rep2, status1rep3. And then add all of these to the design: ~status + time + status0rep2 + ... + status:time

You can ignore a message about the design including numeric variables, if that comes up.

ADD REPLY • link 9.2 years ago Michael Love 41k

0

Entering edit mode

Hi Michael,

Thanks again for your quick response! I downloaded the development version, and was now able to process the data also in case of different number of replicates for the two status conditions!

Best regards,

Cor

ADD REPLY • link 9.2 years ago c.lieftink@nki.nl ▴ 40

0

Entering edit mode

Hi Michael,

Yesterday I ran an analysis of 15 mutant cell lines against 226 wildtype/unknown cell lines for 1200 hairpins. It took 8 hours on my home computer, but I have the results! :-) Thanks again.

Cor

ADD REPLY • link 9.2 years ago c.lieftink@nki.nl ▴ 40

0

Entering edit mode

If possible could you send the dataset to me (maintainer("DESeq2")). That's slower than I've seen before.

For a 2 group comparison with simulated data, 250 samples takes 40 seconds on my laptop:

> dds = makeExampleDESeqDataSet(m=250,n=1200)
> system.time({ dds = DESeq(dds) })
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
user system elapsed
39.901 0.036 39.903

ADD REPLY • link 9.2 years ago Michael Love 41k

0

Entering edit mode

hi Cor,

I'm continuing our conversation on the site, but won't mention any specifics.

I was confused about why it would take so long, but I should have paid attention in your comment that you want to account for 100s of cell lines in the model with terms for each. With 100s of cell lines, the model matrix is growing to have many hundreds of columns. And the GLM solution requires iterations to converge to the solution, so this involves large matrix multiplications.

I would actually recommend using voom and limma if you have 100s of columns in the model matrix. This would speed things up a lot, because the solution in the linear case can be easily calculated, even when the model matrix has 100s of columns . You can also directly use normalized CPM (counts per million) directly with voom and limma, without having to try to convert back to raw counts.

ADD REPLY • link 9.2 years ago Michael Love 41k

0

Entering edit mode

Hi Michael,

Thanks a lot!

Best regards,

Cor

ADD REPLY • link 9.2 years ago c.lieftink@nki.nl ▴ 40