Question

DESEq2 paired-sample comparison

0

Entering edit mode

alifar76 • 0

@alifar76-8934

Last seen 9.1 years ago

United States

Hello,

So, this is yet another question about the model-matrix-is-not-full-rank-error. I have looked through many posts but have not been able to resolve this error. I have 6 samples in the dataset. The data-set is as follows:

               subject_grp               Disease          Time-point
Sample1            1                       Yes                B
Sample2            1                       Yes                T
Sample3            4                        No                B
Sample4            4                        No                T
Sample5            9                       Yes                B
Sample6            9                       Yes                T

I just want to check the impact of Disease variable on count data (of 16S amplicons) keeping in the mind that the 6 samples are in paired by subject_grp variable.

I don't care about the Time-point variable but if it helps in resolving the problem, please let me know.

Anyway, when I use the design as subject_grp + Disease, I get the dreaded matrix not full rank error. I have tried a bunch of different designs including A: DESeq2 paired multifactor test but so far have not been successful. I'd greatly appreciate if some help can be provided. Thanks a lot.

Ali

> sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.2 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] DESeq2_1.6.3              RcppArmadillo_0.5.200.1.0 Rcpp_0.11.6              
 [4] GenomicRanges_1.18.4      GenomeInfoDb_1.2.4        IRanges_2.0.1            
 [7] S4Vectors_0.4.0           BiocGenerics_0.12.1       plyr_1.8.3               
[10] koRpus_0.05-6             secondgenomeR_0.2.64      pander_0.5.2             
[13] knitr_1.10.5              ggplot2_1.0.1             phyloseq_1.10.0          

loaded via a namespace (and not attached):
 [1] nlme_3.1-121         bitops_1.0-6         matrixStats_0.14.2   RColorBrewer_1.1-2  
 [5] tools_3.2.1          vegan_2.3-0          rpart_4.1-10         KernSmooth_2.23-15  
 [9] Hmisc_3.16-0         DBI_0.3.1            mgcv_1.8-6           colorspace_1.2-6    
[13] permute_0.8-4        ade4_1.7-2           nnet_7.3-10          gridExtra_2.0.0     
[17] chron_2.3-47         sendmailR_1.2-1      Biobase_2.26.0       ggdendro_0.1-15     
[21] caTools_1.17.1       scales_0.2.5         checkmate_1.6.1      BatchJobs_1.6       
[25] genefilter_1.48.1    stringr_1.0.0        digest_0.6.8         foreign_0.8-65      
[29] XVector_0.6.0        base64enc_0.1-2      limma_3.22.4         RSQLite_1.0.0       
[33] BBmisc_1.9           BiocParallel_1.0.2   gtools_3.5.0         acepack_1.3-3.3     
[37] RCurl_1.95-4.7       magrittr_1.5         Formula_1.2-1        Matrix_1.2-2        
[41] munsell_0.4.2        ape_3.3              proto_0.3-10         stringi_0.5-5       
[45] edgeR_3.8.5          MASS_7.3-43          RJSONIO_1.3-0        zlibbioc_1.12.0     
[49] gplots_2.17.0        fail_1.2             grid_3.2.1           gdata_2.17.0        
[53] crayon_1.3.1         lattice_0.20-33      Biostrings_2.34.1    splines_3.2.1       
[57] multtest_2.22.0      annotate_1.44.0      locfit_1.5-9.1       igraph_1.0.1        
[61] geneplotter_1.44.0   reshape2_1.4.1       codetools_0.2-14     XML_3.98-1.3        
[65] latticeExtra_0.6-26  biom_0.3.12          data.table_1.9.4     foreach_1.4.2       
[69] testthat_0.10.0      gtable_0.1.2         xtable_1.7-4         metagenomeSeq_1.10.0
[73] survival_2.38-3      pheatmap_1.0.7       iterators_1.0.7      AnnotationDbi_1.28.1
[77] memoise_0.2.1        cluster_2.0.2        brew_1.0-6

rnaseq deseq2 formulafullmodel • 2.5k views

ADD COMMENT • link 9.1 years ago alifar76 • 0

score 1 · Answer 1 · 2015-10-06

1

Entering edit mode

Michael Love 42k

@mikelove

Last seen 3 days ago

United States

Check out the vignette for DESeq2 version >= 1.8:

http://bioconductor.org/packages/release/bioc/html/DESeq2.html

We have a section "Model matrix not full rank" which talks about this error and this design specifically.

ADD COMMENT • link 9.1 years ago Michael Love 42k

score 0 · Answer 2 · 2015-10-06

0

Entering edit mode

alifar76 • 0

@alifar76-8934

Last seen 9.1 years ago

United States

Hi Michael,

Thank you so much for your answer and pointing towards the vignette for DESeq2. I have gone through the section of "Model matrix is not full rank."

If I have understood correctly, the design of my analysis is similar to the second example you have provided in the vignette (Page 34), which is as follows:

              batch               Condition          
##1            1                       A                
##2            1                       A                
##3            1                       B                
##4            1                       B                
##5            2                       C                
##6            2                       C

For comparison, my design is as follows:

               subject_grp               Disease          Time-point
Sample1            1                       Yes                B
Sample2            1                       Yes                T
Sample3            4                        No                B
Sample4            4                        No                T
Sample5            9                       Yes                B
Sample6            9                       Yes                T

In my case, the batch variable is equivalent to the subject_grp variable and the Disease variable is equivalent to Condition variable. Since the Disease is not balanced across the subject_grp i.e., subject_grp 1 should have both Yes and No, subject_grp 4 should have both Yes and No, so on, we should remove the subject_grp from model forumula (as your vignette suggests on Page 34). Am I correct?

Indeed, I have noticed that when the subject_grp variable is removed from the model formula, it runs fine and I do get some significant features out of the dataset. But my boss insists that since the samples are paired, according to the subject_grp variable, I must include it in the model formula.

Please let me know how this can be done, since I can't seem to find a solution for this in the vignette. Thanks again for all your help.

Best,

Ali

ADD COMMENT • link 9.1 years ago alifar76 • 0

1

Entering edit mode

If you wanted to compare across time point, even testing time-point differences within or across disease would be possible. That's because you would be comparing within subject.

What is not possible with a fixed effects model, is control for any subject effects and at the same time isolate the main disease effect. As an example, suppose that the 1 and 9 subjects have 2x the counts for a gene compared to the 4 subject. Is this because disease induces 2x fold change? Or is this something to do with subjects 1 and 9, that is not related to disease? There is no way to tell from this experiment which is the case. The way to say this in statistical jargon is that these two variables are confounded.

Note that, in the limma package, you can inform the model that samples 1 and 2, 3 and 4, etc. are expected to be correlated, using the duplicateCorrelation function.

ADD REPLY • link 9.1 years ago Michael Love 42k

0

Entering edit mode

Okay, perfect. I think I've got it. The Time-point variable is not of immediate interest to our analysis, so we are not that concerned about it.

So, just to sum up this discussion (and to make sure I've got this absolutely clear in my head), I've got two solutions if I have to use the original design proposed:

1) Remove subject_grp from the formula and run analysis with just ~ Disease, making the unrealistic assumption that there is no subject_grp effect (Page 34 of vignette).

2) Explore limma package (instead of DESeq2) to inform model about sample correlations.

Please let me know this is fine and thanks a lot again for all of your help so far. I really, really appreciate it.

Best,

Ali

ADD REPLY • link 9.1 years ago alifar76 • 0

1

Entering edit mode

That's correct

ADD REPLY • link 9.1 years ago Michael Love 42k

0

Entering edit mode

Awesome! Thanks, again!

ADD REPLY • link 9.1 years ago alifar76 • 0