Question

DESeq2 , model matrix with 4 factors, error: inv(): matrix appears to be singular

1

Entering edit mode

cdemel • 0

@cdemel-10707

Last seen 8.6 years ago

Hi,

I have a experimental setup with 2 different strains, 3 different time points, activation/non-activation and 2 replicates. I've been trying to design the model matrix and write the design formula for some time now, but I still get the "error: inv(): matrix appears to be singular" error.

Here is my experimental design matrix:

strain time replicates activation
AS    0          1      minus
AS  2.5          1      minus
AS  2.5          1       plus
AS  4.5          1      minus
AS  4.5          1       plus
AS    0          2      minus
AS  2.5          2      minus
AS  2.5          2       plus
AS  4.5          2      minus
AS  4.5          2       plus
WT    0          1      minus
WT  2.5          1      minus
WT  2.5          1       plus
WT  4.5          1      minus
WT  4.5          1       plus
WT    0          2      minus
WT  2.5          2      minus
WT  2.5          2       plus
WT  4.5          2      minus
WT  4.5          2       plus

my original idea for the model matrix was

mm <- model.matrix(~ strain + activation + replicates + time:activation + strain:activation + strain:time:activation, expDesign)

because I believe the strain, activation, and replicates could have effects themselves, but there are also interaction effects for activation with time and strain.

I know, the matrix is not full rank, because there are no samples for timepoint 0 with activation. So after reading the vignette section "Levels without samples" I thought I could drop the corresponding column in the model matrix and run DESeq2:

> w <- which(colSums(mm)==0)
> mm <- mm[,-w]
>   dds = DESeq(deseq.matrix, full=mm, betaPrior=FALSE)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
using supplied model matrix

error: inv(): matrix seems singular
Error: inv(): matrix seems singular
In addition: Warning message:
In fitNbinomGLMs(objectNZ[fitidx, , drop = FALSE], alpha_hat = alpha_hat[fitidx],  :
  114rows had non-positive estimates of variance for coefficients

But I still get the "matrix seems singular" error. Can someone tell me what I am doing wrong or how my design should look like?

Thanks in advance,

Carina

deseq2 multiple factor design model.matrix • 1.7k views

ADD COMMENT • link updated 8.6 years ago by Michael Love 43k • written 8.6 years ago by cdemel • 0

score 0 · Answer 1 · 2016-05-16

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 6 days ago

United States

Can you please post your sessionInfo() ?

Maybe we can take a step back. What is the biological question you want to answer with this data?

ADD COMMENT • link 8.6 years ago Michael Love 43k

0

Entering edit mode

Hi,

this im my sessionInfo:

> sessionInfo()
R version 3.3.0 (2016-05-03)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Scientific Linux 7.1 (Nitrogen)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] rtracklayer_1.32.0         doParallel_1.0.10
 [3] iterators_1.0.8            foreach_1.4.3
 [5] DESeq2_1.12.1              LSD_3.0
 [7] GenomicAlignments_1.8.0    Rsamtools_1.24.0
 [9] Biostrings_2.40.0          XVector_0.12.0
[11] SummarizedExperiment_1.2.1 Biobase_2.32.0
[13] GenomicRanges_1.24.0       GenomeInfoDb_1.8.1
[15] IRanges_2.6.0              S4Vectors_0.10.0
[17] BiocGenerics_0.18.0

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.4.5        RColorBrewer_1.1-2   plyr_1.8.3
 [4] tools_3.3.0          bitops_1.0-6         zlibbioc_1.18.0
 [7] rpart_4.1-10         RSQLite_1.0.0        annotate_1.50.0
[10] nlme_3.1-128         gtable_0.2.0         lattice_0.20-33
[13] mgcv_1.8-12          Matrix_1.2-6         DBI_0.4-1
[16] gridExtra_2.2.1      genefilter_1.54.1    cluster_2.0.4
[19] locfit_1.5-9.1       grid_3.3.0           nnet_7.3-12
[22] data.table_1.9.6     AnnotationDbi_1.34.1 XML_3.98-1.4
[25] survival_2.39-4      BiocParallel_1.6.1   foreign_0.8-66
[28] latticeExtra_0.6-28  Formula_1.2-1        geneplotter_1.50.0
[31] ggplot2_2.1.0        codetools_0.2-14     Hmisc_3.17-4
[34] scales_0.4.0         splines_3.3.0        xtable_1.8-2
[37] colorspace_1.2-6     acepack_1.3-3.3      RCurl_1.95-4.8
[40] munsell_0.4.3        chron_2.3-47

I am interested if there is a significant difference in WT and AS strains (without activation, so at 0h time point), but also in the differentially expressed genes upon activation over time in each of the strains.

ADD REPLY • link 8.6 years ago cdemel • 0

0

Entering edit mode

This is a complex experimental design because it involves second-order interaction terms and a number of implicit assumptions (no difference between plus and minus samples at time=0, such that the plus samples at time=0 were not generated). If it seems confusing and you aren't sure how to interpret things, you will need to find a local statistician who can help explain things. It really deserves a face-to-face meeting to go over the interpretations of terms here. Another complication is that you can't use fixed effects to account for the correlation within replicate within strain, and then simultaneously compare across strain, because these are collinear terms. You can account for replicate correlation with other software packages, such as the duplicateCorrelation function in limma, but DESeq2 does not have this kind of functionality. So my solution below doesn't account for correlation within replicate within strain.

You'll want to make sure "WT" is the reference level of strain (see vignette), "plus" is the reference level for activation, and that time is a factor --- otherwise you are assuming constant changes across time which for many genes could not be the case (both saturation or up-down/down-up expression can occur).

I would use a design of ~ time + strain + activation:time + activation:strain:time, but then you need to remove a number of terms, because of the peculiarities of your experimental design. I would keep the terms: (Intercept), time2.5, time4.5, strainAS, time2.5:activationplus, time4.5:activationplus, time2.5:strainAS:activationplus, time4.5:strainAS:activationplus. You can then provide the model matrix to the 'full' argument of DESeq() as you were before. This will be a full rank matrix.

The strainAS term is the difference AS vs WT without activation. You can pull this out with results and name="strainAS".

The DE genes upon activation for WT are time2.5:activationplus and time4.5:activationplus and again you would pull these out, e.g. for time=2.5, with name="time2.5:activationplus".

The DE genes upon activation for AS are the above terms plus the interaction of strain, time and activation. You would pull this out, e.g. for time=2.5, using results with contrast=list(c("time2.5:activationplus", "time2.5:strainAS:activationplus")).

Again, I'd recommend speaking to someone local who can help with further questions and writing up the interpretation if things are not clear.

ADD REPLY • link 8.5 years ago Michael Love 43k

0

Entering edit mode

Hi Michael,

thank you very much for you informative answer.

By now, I correlated replicate counts as they are very similar, there is very like no replicate effect, so I decided to leave out the replicate term. This also fits with your answer. I will try to make my design as you suggested.

Thanks again,

Best, Carina

ADD REPLY • link 8.5 years ago cdemel • 0