Search
Question: DESeq2, error: inv(): matrix appears to be singular
0
4.0 years ago by
Sweden
sandramg820 wrote:

Hi,
I am finding an error with estimateDispersions() in DESeq2.

In my experimental design I have 3 factors: Treatment, Time and Depth

I start with a phyloseq object (PHYLUM) that I convert to DESeq (Phylum)
easily, including two of the factors in the design.

> Phylum=phyloseq_to_deseq2(PHYLUM,~ Time + Treatment)
> Phylum
class: DESeqDataSet
dim: 57 30
exptData(0):
assays(1): counts
rownames(57): OTU1 OTU2 ... OTU56 OTU57
colnames(30): T2_SF5C T25_SF1B ... T2_DP1A T2_DP1C
colData names(3): Treatment Time Depth

> colData(Phylum)
DataFrame with 30 rows and 3 columns
Treatment     Time    Depth
<factor> <factor> <factor>
T2_SF5C      DOMNP      02d      25m
T25_SF1B         C      25d      25m
T25_SF5B     DOMNP      25d      25m
T25_SF4A       DOM      25d      25m
T2_SF4A        DOM      02d      25m
...            ...      ...      ...
T25_DP5A     DOMNP      25d     125m
T25_DP4B       DOM      25d     125m
T25_DP5B     DOMNP      25d     125m
T2_DP1A          C      02d     125m
T2_DP1C          C      02d     125m

All contrasts work fine with this design (~ Time + Treatment).

The problem comes when I try to perform interactions. I find and error when estimating dispersions (see below). Do you know how can I solve it? Is it related to the nature of my factors?
I have seen a previous post in which the same error was discussed but in that case the problem was the opposite: the error occurred with the normal contrast and was solved in the interaction. I am using DESeq 1.4.5.

Thanks so much!!!!

> design(Phylum) <- ~ Time + Treatment + Time:Treatment
> Phylum<-estimateSizeFactors(Phylum)
> Phylum<-estimateDispersions(Phylum,fitType="local")
gene-wise dispersion estimates

error: inv(): matrix appears to be singular

Error: inv(): matrix appears to be singular

modified 4.0 years ago • written 4.0 years ago by sandramg820
0
4.0 years ago by
Michael Love19k
United States
Michael Love19k wrote:

Can you email me a dataset which reproduces the problem? e.g. save(Phylum, file="phylum.rda")

I'll see if it still gives an error using the development branch, which will be released shortly.

My email is maintainer("DESeq2")

Thanks for reporting this. The issue is that your model cannot be fit as specified (with the interaction term) because you only have control samples at the first time point. The reason you get this not-so-nice error message is that DESeq2 runs a test during dds object construction on the design, and then prints a nice message explaining this problem. I've now added another test for this problem at the estimateDispersions() step with a nice message (rather than "matrix appears to be singular"), in case the design has changed since object construction.

Here's how you can explore your model and some explanation on why it cannot be fit as specified:

# first, rearrange the samples so that we can easily see the groups
Phylum <- Phylum[ , order(Phylum$Time, Phylum$Treatment) ]
as.data.frame(colData(Phylum))
# this is the model matrix which needs to be fit within estimateDispersions
mm <- model.matrix(~ Time + Treatment + Time:Treatment, colData(Phylum))
# we can view it in the plot window
image(t(mm[nrow(mm):1,]),col=c("white","black"))
# note that these three columns are linearly dependent
all(mm[,"TreatmentDOM"] == mm[,"Time02d:TreatmentDOM"] + mm[,"Time25d:TreatmentDOM"])
[1] TRUE


The sum of the interaction terms for treatment DOM with time 2d and 25d are equal to the main effect for treatment DOM. This would not be the case if you had samples at time 0d for treatment DOM. Note that these interaction terms can be thought of as "how is the effect of DOM different at time 2d than at time 0d", so it makes sense that the model cannot be fit with these interaction terms, because the experiment didn't measure the effect of DOM at time 0d. A solution is to remove the interaction terms from the design.

Thanks Mike,

First, thanks so much for your help. This website is really good for every DESEQ user.

Regarding my problem: Now I understand.

Following your explanation, if I take out day 0, my dataset would allow to make comparisons between C, DOM and DOMNP in 02 and 25d right?

I do not have samples at time 0d for the two addition treatments because they are supposed (by definition of our experimental design) to be similar to the control (at least that is our assumption). Maybe what I can do is to "create" one sample DOM_00d and another DOMNP_00d and copy the data I have for C_00d. These samples will not be real but will represent our ecological  assumptions. in our statistical test.

If I remove the interaction term as you suggest I will not be able to separate the general effect of treatments between the different sampling points (the effect of treatment is supposed to change at 25d compared to the effect at 02d), and that is important for us.

Also, I would like to perform a "time-series" analysis (similar to a RMANOVA test) so I think I do need to keep the interaction on my design...

What do you think?

Easiest is to just remove the two time 0 samples. You don't get much from them, only one degree of freedom for estimating dispersion, whereas you already have many residual degrees of freedom. And these two samples don't help you investigate the treatment effect or how it changes from 2d to 25d.

(The problem with using a design Time + Time:Treatment is that R's base function model.matrix -- which we make use of inside the DESeq2 functions -- will add a column of 0's to the model matrix. I can try to work on this but not in this release cycle.)

Thanks Mike,

I am going to try and figure out how to perform the RMANOVA, any advises?

For genes which vary over time due to treatment, use the likelihood ratio test (see vignette):

design(dds) = ~ Time + Treatment + Time:Treatment
​dds = DESeq(dds, test="LRT", reduced= ~ Time + Treatment)
results(dds)

Hi Mike, thanks so much. I have now tried 2 different approaches: the Time-series approach (OPCION A) and the Time+Treatment interaction approach (OPTION B). I am not sure I understand completely the difference in the results I get. Maybe I do not need a Time-series design. Hope you can help me.

My main questions for this dataset are:
Which is the effect of the different treatments in general (all time points included)?
Which is the effect of the different treatments at each different time?
Which is the difference between DOM and DOMNP in general and in each time point?
Which is the effect of time?

This was my dataset:

Phylum=phyloseq_to_deseq2(PHYLUM,~ Time + Treatment + Time:Treatment)

OPCION A: TME-SERIES DESIGN

> Phylum= DESeq(Phylum, fitType="local", test="LRT", reduced= ~ Time + Treatment)

> results(Phylum)
log2 fold change: Time25d.TreatmentDOMNP
LRT p-value: '~ Time + Treatment + Time:Treatment' vs '~ Time + Treatment'
DataFrame with 231 rows and 6 columns
baseMean log2FoldChange     lfcSE      stat    pvalue      padj
<numeric>      <numeric> <numeric> <numeric> <numeric> <numeric>
OTU1      6.671043       1.284642  2.088933 1.6123348 0.8065735 0.9994033
OTU2      2.203649       1.425796  1.793131 2.2630540 0.6875035 0.9994033

Does the results() above test all times after Time 0 at once (that is the general effect of treatment during the complete incubation)?
Why does it say just time 25d?
How can I change it to test for the other level of Treatment (DOM)?

And to test DOM vs DOMNP?

Now I call resultsNames()
>resultsNames(Phylum)
#[1] "Intercept"              "Time_02d_vs_00d"        "Time_25d_vs_00d"        "Treatment_DOM_vs_C"     "Treatment_DOMNP_vs_C"
#[6] "Time02d.TreatmentDOM"   "Time25d.TreatmentDOM"   "Time02d.TreatmentDOMNP" "Time25d.TreatmentDOMNP"

- On one hand I have the main results:

The effects of the different times over the initial time, e.j:
results(Phylum, name="Time_02d_vs_00d")

and the effect for treatments over Control at time 0, ej:
results(Phylum, name="Treatment_DOM_vs_C")

Is this correct?

- On the other hand the following will return a results table with Wald tests of the additional treatment effect in time 25 (additional beyond the treatment effect at time 0)

Phylum <- nbinomWaldTest(Phylum, betaPrior=FALSE)
results(Phylum, name="Time25d.TreatmentDOM")

Is this similar to the first LRT results above, except now we are asking for a different effect of Treatment in  a specific time, not in all times?

The problem is that it does not give me an option to test the difference between DOM and DOMNP, right?

I thought that a time series approach would be appropriate since they are the same individuals sampled at different times...but I am starting to think that making a interaction design would be more appropriate to get all the specific interactions between the different treatments

OPTION B: INTERACTION DESIGN

Phylum = DESeq(Phylum, fitType="local")

results(Phylum)

log2 fold change (MAP): Time25d.TreatmentDOMNP
Wald test p-value: Time25d.TreatmentDOMNP
DataFrame with 231 rows and 6 columns
baseMean log2FoldChange     lfcSE        stat    pvalue      padj
<numeric>      <numeric> <numeric>   <numeric> <numeric> <numeric>
OTU1      6.671043     0.26544913 0.2541302  1.04453979 0.2962357 0.9216456
OTU2      2.203649     0.25048700 0.2572755  0.97361374 0.3302484 0.9418194

This test above tests all times after Time 0 at once (that is the general effect of treatment during the complete incubation)?
Why does it say just time 25d?
How can I change it to test for the other level of Treatment (DOM)?

If I call resultsNames() I can make much more combinations:

resultsNames(Phylum)
#[1] "Intercept"              "Time00d"                "Time02d"                "Time25d"                "TreatmentC"
#[6] "TreatmentDOM"           "TreatmentDOMNP"         "Time00d.TreatmentC"     "Time02d.TreatmentC"     "Time25d.TreatmentC"
#[11] "Time00d.TreatmentDOM"   "Time02d.TreatmentDOM"   "Time25d.TreatmentDOM"   "Time00d.TreatmentDOMNP" "Time02d.TreatmentDOMNP"
#[16] "Time25d.TreatmentDOMNP"

Can I test for the general effect of a treatment (DOM) in all time points like this?
res <- results(Phylum, contrast =list("TreatmentC","TreatmentDOM"))

Can I test for the effect of a treatment (DOM) in a specific time point (25d) like this?
res <- results(Phylum, contrast =list("Time25d.TreatmentC","Time25d.TreatmentDOM"))

And for the effect of time in an specific treatment(DOM) like this?
res <- results(Phylum, contrast =list("Time02d.TreatmentDOM","Time25d.TreatmentDOM"))

I am not sure that the time-series approach (if I am using it correctly) is adding too much info to my analysis, in fact, it decreases the amount of contrasts I can make. Am I right?

Thanks so much for your help and for all you patience

Sandra

I'd prefer to just give recommendations for how to investigate your biological questions of interest. There are many ways to code up and run an analysis.

Firstly, you ask "Why does it say just time 25d?". The answer to this is in the help page for results(). Help for R functions is easily accessible by typing ?results in the R console:

"For analyses using the likelihood ratio test (using nbinomLRT), the p-values are determined solely by the difference in deviance between the full and reduced model formula. A log2 fold change is included, which can be controlled using the name argument"

"My main questions for this dataset are:
Which is the effect of the different treatments in general (all time points included)?
Which is the effect of the different treatments at each different time?
Which is the difference between DOM and DOMNP in general and in each time point?
Which is the effect of time?"

The easiest way for you to investigate these various comparisons is using the likelihood ratio test. If you are not familiar with the purpose or mechanics of the likelihood ratio test, I highly recommend talking to a local statistician. We explain it briefly in the vignette and help files, but it's best to have a person who you can talk to face-to-face.

I thought of a way for you to code your variables so that you can be specific about which interaction terms you want, and so that you can include the time 0 samples which didn't have treatment.

Using the full dataset which includes time 0, recode the treatment variables like so (I'll use dds as the dataset object):

dds$DOM = factor(ifelse(dds$Treatment == "DOM", "t", "f"), levels=c("f","t"))
dds$DOMP = factor(ifelse(dds$Treatment == "DOMP", "t", "f"), levels=c("f","t"))

and do the same for the timepoints after time 0:

dds$day02 = factor(ifelse(dds$Time == "02d", "t", "f"), levels=c("f","t"))
dds$day25 = factor(ifelse(dds$Time == "25d", "t", "f"), levels=c("f","t"))

Now set the design as:

design(dds) = ~ DOM + DOMP + day02 + day25 + DOM:day25 + DOMP:day25

You can then estimate size factors and dispersions:

dds = estimateSizeFactors(dds)
dds = estimateDispersions(dds)

Now you can remove terms in order to investigate different questions. To find genes which change due to treatment of DOM or DOMP at time 25, remove the interaction terms for the treatments at day 25. Remember: interaction terms, for example DOM:day25, model an additional effect for DOM at day 25 which is different than the effect at day 2.

dds = nbinomLRT(dds, reduced = ~ DOM + DOMP + day02 + day25)
results(dds)

​If you want to find genes which change due to treatment of DOMP, remove the terms containing DOMP:

dds = nbinomLRT(dds, reduced= ~ DOM + day02 + day25 + DOM:day25)
​results(dds)

To find genes which change due to either DOM or DOMP treatment, remove all the terms with DOM or DOMP in them (including the interaction terms, e.g. DOM:day25). Similarly for genes which change due to time, remove the terms with day02 or day25 in them.

0
4.0 years ago by
Sweden
sandramg820 wrote:

Sorry Mike....I ma new on this, not sure how to mail to you...is there an application in the web? which is you complete address for external mail??