Question: replicates and groups of samples: what's the right DESeq2 model?
gravatar for Bob Thurman
6 months ago by
Seattle, WA, USA
Bob Thurman0 wrote:


I’m having some difficulty with what would seem to be a pretty basic set-up.  I have a panel of, say, 20 cell lines, c1, c2, …,c20, with two replicates for each cell line, so 40 samples total.  The cell lines can be divided into groups based on different phenotypic properties and I’d like to do comparisons between groups.  As an example, based on one phenotype the cell lines segregate into 4 groups, with c1-c5 in g1, c6-10 in g2, c11-15 in g3, c16-20 in g4.  And I’d like to compare, say, g1 with g2, which would be 10 samples (2 replicates each of 5 cell lines) in each group; then compare g1 with g3, etc.

What is the best way to model this in DESeq2?

The approach I’ve been taking is to use all samples in the model and capture the replicate relationships using cell line labels,

cold <- data.frame(cells = c(“c1”, “c1”, “c2”, “c2”, …”c20”, “c20”))
dds <- DESeqDataSetFromMatrix(counts, cold, formula(~ cells))
DEOut <- DESeq(dds)

...and then use contrasts to group the cell lines for a given comparison.  Since resultsNames(DEOut) gives elements like


to compare g2 with g3, I’d do

results(DEOut, contrast = list(c(“cells_c6_vs_c1”, “cells_c7_vs_c1”,...“cells_c10_vs_c1”),
                               c(“cells_c11_vs_c1”, “cells_c12_vs_c1”,...“cells_c15_vs_c1”)))

Is there something not recommended here?  The reasons I’m suspicious:

  1. A particular comparison like the above turns up 7000+ regulated genes at an adjusted p-value threshold of 0.01, with extreme p-values and fold changes: returned adjusted p-values actually reach 0, and absolute log2FoldChanges go as high as 60.  If I instead model by using group labels (g1, g2, g3, g4) for the samples (and therefore ignore the cell line replicate relationships), then comparing g2 with g3 results in fewer than 500 regulated genes, and absolute log2FoldChanges going as high as 25.
  2. Spot-checking the regulated genes seems to indicate a dependence on the expression of the reference cell line c1 that I don’t want.  That is, the genes that are up-regulated (log2FoldChange > 0) not only roughly have g2 average expression higher than g3 average expression (which is what I want), but even more apparent is that the average expression of the g2 cell lines is higher than the individual cell line expression for c1.  Likewise, the trend for down-regulated genes (log2FoldChange < 0) is that the average expression of the g2 cell lines is lower than for c1.

Should I perhaps be using a model without an Intercept term?

Thanks in advance for the help!


R version 3.4.0 (2017-04-21)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux

Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/

 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] DESeq2_1.16.1              SummarizedExperiment_1.6.3
 [3] DelayedArray_0.2.7         matrixStats_0.52.2        
 [5] Biobase_2.36.2             GenomicRanges_1.28.4      
 [7] GenomeInfoDb_1.12.2        IRanges_2.10.2            
 [9] S4Vectors_0.14.3           BiocGenerics_0.22.0       
ADD COMMENTlink modified 6 months ago by Gavin Kelly550 • written 6 months ago by Bob Thurman0
gravatar for Gavin Kelly
6 months ago by
Gavin Kelly550
United Kingdom / London / Francis Crick Institute
Gavin Kelly550 wrote:

That formulation appears to be estimating (when you call DESeq) the replicate variability, but not the between-lines-within-phenotype variability. The contrast is then comparing the averages (across lines within the phenotype), but is ignoring the variability of the lines within the phenotype.  It's an ingenious approach, but flawed because it ignores this important source of variability.

What I believe is the 'approved' approach is described in the DESeq2 vignette, under the section "Group-specific condition effects, individuals nested within groups" (but you don't have a 'condition').  You'd need to relabel each line with an index within the particular phenotype (so you'd have line A within group G1,  line A within group G2... line J within group G1, lineJ within group G2 ), and a design of ~group + group:line_within_group (the 'pairing' you might worry about is arbitrary, and has no effect because of this design).  When the groups were unbalanced in terms of number of lines, you're going to get un-estimatable coefficients.  I don't think there's a way around that - you'd either sacrifice them or have an arbitrary scheme of pooling them with other lines (increasing the 'replicate' variability).

Alternatively you could use voom + limma, where you have the ability to have a random-effect due to cell-to-cell variation, by using the duplicateCorrelation function therein.



ADD COMMENTlink modified 6 months ago • written 6 months ago by Gavin Kelly550

I'd second the recommendations here from Gavin.

To compare directly across groups and control for nested structure isn't possible with fixed effects models, and hence I often point users to limma's duplicateCorrelation() function which is designed for this.

If you were making within-cell-line comparisons (which it sounds like you're not), then it would be possible to account for the cell line differences and compare those within-cell-line effects across groups (and this is the section of the vignette Gavin mentions).

ADD REPLYlink written 6 months ago by Michael Love17k

Very helpful, Gavin and Michael; thank you!

ADD REPLYlink written 6 months ago by Bob Thurman0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 213 users visited in the last hour