DESeq2 , model matrix with 4 factors, error: inv(): matrix appears to be singular
1
1
Entering edit mode
cdemel • 0
@cdemel-10707
Last seen 8.6 years ago

Hi,

I have a experimental setup with 2 different strains, 3 different time points, activation/non-activation and 2 replicates. I've been trying to design the model matrix and write the design formula for some time now, but I still get the "error: inv(): matrix appears to be singular" error.

Here is my experimental design matrix:

strain time replicates activation
AS    0          1      minus
AS  2.5          1      minus
AS  2.5          1       plus
AS  4.5          1      minus
AS  4.5          1       plus
AS    0          2      minus
AS  2.5          2      minus
AS  2.5          2       plus
AS  4.5          2      minus
AS  4.5          2       plus
WT    0          1      minus
WT  2.5          1      minus
WT  2.5          1       plus
WT  4.5          1      minus
WT  4.5          1       plus
WT    0          2      minus
WT  2.5          2      minus
WT  2.5          2       plus
WT  4.5          2      minus
WT  4.5          2       plus

 

my original idea for the model matrix was

mm <- model.matrix(~ strain + activation + replicates + time:activation + strain:activation + strain:time:activation, expDesign)

because I believe the strain, activation, and replicates could have effects themselves, but there are also interaction effects for activation with time and strain.

I know, the matrix is not full rank, because there are no samples for timepoint 0 with activation. So after reading the vignette section "Levels without samples" I thought I could drop the corresponding column in the model matrix and run DESeq2:

> w <- which(colSums(mm)==0)
> mm <- mm[,-w]
>   dds = DESeq(deseq.matrix, full=mm, betaPrior=FALSE)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
using supplied model matrix

error: inv(): matrix seems singular
Error: inv(): matrix seems singular
In addition: Warning message:
In fitNbinomGLMs(objectNZ[fitidx, , drop = FALSE], alpha_hat = alpha_hat[fitidx],  :
  114rows had non-positive estimates of variance for coefficients

But I still get the "matrix seems singular" error. Can someone tell me what I am doing wrong or how my design should look like?

Thanks in advance,

Carina

 

 

deseq2 multiple factor design model.matrix • 1.7k views
ADD COMMENT
0
Entering edit mode
@mikelove
Last seen 6 days ago
United States

Can you please post your sessionInfo() ?

Maybe we can take a step back. What is the biological question you want to answer with this data?

ADD COMMENT
0
Entering edit mode

Hi,

this im my sessionInfo:

> sessionInfo()
R version 3.3.0 (2016-05-03)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Scientific Linux 7.1 (Nitrogen)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] rtracklayer_1.32.0         doParallel_1.0.10
 [3] iterators_1.0.8            foreach_1.4.3
 [5] DESeq2_1.12.1              LSD_3.0
 [7] GenomicAlignments_1.8.0    Rsamtools_1.24.0
 [9] Biostrings_2.40.0          XVector_0.12.0
[11] SummarizedExperiment_1.2.1 Biobase_2.32.0
[13] GenomicRanges_1.24.0       GenomeInfoDb_1.8.1
[15] IRanges_2.6.0              S4Vectors_0.10.0
[17] BiocGenerics_0.18.0

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.4.5        RColorBrewer_1.1-2   plyr_1.8.3
 [4] tools_3.3.0          bitops_1.0-6         zlibbioc_1.18.0
 [7] rpart_4.1-10         RSQLite_1.0.0        annotate_1.50.0
[10] nlme_3.1-128         gtable_0.2.0         lattice_0.20-33
[13] mgcv_1.8-12          Matrix_1.2-6         DBI_0.4-1
[16] gridExtra_2.2.1      genefilter_1.54.1    cluster_2.0.4
[19] locfit_1.5-9.1       grid_3.3.0           nnet_7.3-12
[22] data.table_1.9.6     AnnotationDbi_1.34.1 XML_3.98-1.4
[25] survival_2.39-4      BiocParallel_1.6.1   foreign_0.8-66
[28] latticeExtra_0.6-28  Formula_1.2-1        geneplotter_1.50.0
[31] ggplot2_2.1.0        codetools_0.2-14     Hmisc_3.17-4
[34] scales_0.4.0         splines_3.3.0        xtable_1.8-2
[37] colorspace_1.2-6     acepack_1.3-3.3      RCurl_1.95-4.8
[40] munsell_0.4.3        chron_2.3-47

 

I am interested if there is a significant difference in WT and AS strains (without activation, so at 0h time point), but also in the differentially expressed genes upon activation over time in each of the strains.

 

ADD REPLY
0
Entering edit mode

This is a complex experimental design because it involves second-order interaction terms and a number of implicit assumptions (no difference between plus and minus samples at time=0, such that the plus samples at time=0 were not generated). If it seems confusing and you aren't sure how to interpret things, you will need to find a local statistician who can help explain things. It really deserves a face-to-face meeting to go over the interpretations of terms here. Another complication is that you can't use fixed effects to account for the correlation within replicate within strain, and then simultaneously compare across strain, because these are collinear terms. You can account for replicate correlation with other software packages, such as the duplicateCorrelation function in limma, but DESeq2 does not have this kind of functionality. So my solution below doesn't account for correlation within replicate within strain.

You'll want to make sure "WT" is the reference level of strain (see vignette), "plus" is the reference level for activation, and that time is a factor --- otherwise you are assuming constant changes across time which for many genes could not be the case (both saturation or up-down/down-up expression can occur).

I would use a design of ~ time + strain + activation:time + activation:strain:time, but then you need to remove a number of terms, because of the peculiarities of your experimental design. I would keep the terms: (Intercept), time2.5, time4.5, strainAS, time2.5:activationplus, time4.5:activationplus, time2.5:strainAS:activationplus, time4.5:strainAS:activationplus. You can then provide the model matrix to the 'full' argument of DESeq() as you were before. This will be a full rank matrix.

The strainAS term is the difference AS vs WT without activation. You can pull this out with results and name="strainAS".

The DE genes upon activation for WT are time2.5:activationplus and time4.5:activationplus and again you would pull these out, e.g. for time=2.5, with name="time2.5:activationplus".

The DE genes upon activation for AS are the above terms plus the interaction of strain, time and activation. You would pull this out, e.g. for time=2.5, using results with contrast=list(c("time2.5:activationplus", "time2.5:strainAS:activationplus")).

Again, I'd recommend speaking to someone local who can help with further questions and writing up the interpretation if things are not clear.

ADD REPLY
0
Entering edit mode

Hi Michael,

thank you very much for you informative answer.

By now, I correlated replicate counts as they are very similar, there is very like no replicate effect, so I decided to leave out the replicate term. This also fits with your answer. I will try to make my design as you suggested.

Thanks again,

Best, Carina

ADD REPLY

Login before adding your answer.

Traffic: 1050 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6