Question

Not full rank design using Deseq2

1

Entering edit mode

Rainy Gorilla ▴ 30

@rainy-gorilla-5284

Last seen 7.0 years ago

Netherlands

Dear all,

I have count data from healthy and diseased patients. I have 6 healthy samples vs 14 samples. The diseased patient underwent different treatment types and had biomarkers measured. Because the healthy doner did not undergo any treatment and were not measured for biomarkers I have problems creating a design because of this.

I am interested in genes deferentially expressed between healthy and diseased but also looking at the treatment type effect and the biomarker signature type vs healthy. I also have a classification in disease types.

Should I try ignoreRank=TRUE or would it be best to create a new design without the healthy group and then define a new control group in diseased patients based on treatment and biomarkers? But then, how can I do the comparison with the healthy group afterwards?

> colDATA

      row.names    condition    type    treatment    biomarkers
01    Healthy1     Healthy    nonDiseased    non    non
02    Healthy2     Healthy    nonDiseased    non    non
03    Healthy3     Healthy    nonDiseased    non    non
04    Healthy4     Healthy    nonDiseased    non    non
05    Healthy5     Healthy    nonDiseased    non    non
06    Healthy6     Healthy    nonDiseased    non    non
07    Diseased1    Diseased    Affected      1      2
08    Diseased2    Diseased    Affected      2      2
09    Diseased3    Diseased    Affected      1      1
10    Diseased4    Diseased    Affected      3      2
11    Diseased5    Diseased    Affected      2      0
12    Diseased6    Diseased    Affected      1      2
13    Diseased7    Diseased    Affected      1      2
14    Diseased8    Diseased    nonAffected   1      1
15    Diseased9    Diseased    Affected      2      2
16    Diseased10   Diseased    nonAffected   1      1
17    Diseased11   Diseased11  Affected      3      1
18    Diseased12   Diseased12  Affected      1      2
19    Diseased13   Diseased13  nonAffected   2      1
20    Diseased14   Diseased14  nonAffected   1      2

If I try to create a design using only condition and type or try and other cobination, I get the following error:

dds <- DESeqDataSetFromMatrix(countData = countTABLE,
+                               colData = colDATA,
+                               design = ~ condition + type)
Error in DESeqDataSet(se, design = design, ignoreRank) : 
  the model matrix is not full rank, so the model cannot be fit as specified.
  one or more variables or interaction terms in the design formula
  are linear combinations of the others and must be removed

I think because type, treatment and biomarkers are nested within the diseased condition this will not work?

How can I make the correct design to get the comparisons?

Many thanks for your help.

Session info:

> sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] limma_3.22.1            RColorBrewer_1.1-2      ggplot2_1.0.0           gplots_2.16.0           Biobase_2.26.0          pasilla_0.5.1          
 [7] DESeq2_1.6.3            RcppArmadillo_0.4.600.0 Rcpp_0.11.3             GenomicRanges_1.18.4    GenomeInfoDb_1.2.4      IRanges_2.0.1          
[13] S4Vectors_0.4.0         BiocGenerics_0.12.1    

loaded via a namespace (and not attached):
 [1] acepack_1.3-3.3      annotate_1.44.0      AnnotationDbi_1.28.1 base64enc_0.1-2      BatchJobs_1.5        BBmisc_1.8           BiocParallel_1.0.0  
 [8] bitops_1.0-6         brew_1.0-6           caTools_1.17.1       checkmate_1.5.1      cluster_1.15.3       codetools_0.2-9      colorspace_1.2-4    
[15] DBI_0.3.1            digest_0.6.8         fail_1.2             foreach_1.4.2        foreign_0.8-62       Formula_1.1-2        gdata_2.13.3        
[22] genefilter_1.48.1    geneplotter_1.44.0   grid_3.1.2           gtable_0.1.2         gtools_3.4.1         Hmisc_3.14-6         iterators_1.0.7     
[29] KernSmooth_2.23-13   lattice_0.20-29      latticeExtra_0.6-26  locfit_1.5-9.1       MASS_7.3-35          munsell_0.4.2        nnet_7.3-8          
[36] plyr_1.8.1           proto_0.3-10         reshape2_1.4.1       rpart_4.1-8          RSQLite_1.0.0        scales_0.2.4         sendmailR_1.2-1     
[43] splines_3.1.2        stringr_0.6.2        survival_2.37-7      tools_3.1.2          XML_3.98-1.1         xtable_1.7-4         XVector_0.6.0

deseq2 design counts • 6.2k views

ADD COMMENT • link updated 10.9 years ago by Michael Love 43k • written 10.9 years ago by Rainy Gorilla ▴ 30

score 0 · Answer 1 · 2015-02-12

hi Varshna,

It's tricky to include all your variables in one design, especially given that biomarkers and treatment look somewhat collinear (they are often equal).

One thing I'd recommend to start is to perform EDA, by using the rlog or variance stabilizing transformation (see vignette) and then making a PCA plot coloring with different groups. This will give you some ideas about the large differences across samples.

If you want to compare diseased vs healthy, but also affected vs healthy, nonaffected vs healthy etc., I would use a design of ~ type. Then for the results, use the 'contrast' argument to define which two groups to compare. A comparison of disease vs healthy can be made by averaging the affected and nonaffected effect (see resultsNames(dds) for the proper names to use):

results(dds, contrast=list(c("typeaffected","typenonaffected"), "typenondiseased"), listValues=c(1/2, -1))

The 1/2 means, average over these two groups, and the -1 means, compare to non-diseased.

This will be similar (but not identical, because the above weighs each group equally) to the result if you just used ~ condition with two levels: disease vs healthy (because the type affected/nonaffected with larger sample size will have more influence on the disease effect).

(The ignoreRank argument won't help, this is only for DEXSeq developers. You would immediately get an error upon estimating dispersions.)

For examining the treatment effect, I suggest using alternate designs (e.g. ~ treatment). But I think the fact that treatment and biomarkers are equal for what looks like half of the diseased samples (called "confounding") means that it is difficult to say whether you are seeing a treatment difference or a biomarker difference.