Question

replicates and groups of samples: what's the right DESeq2 model?

0

Entering edit mode

Bob Thurman • 0

@bob-thurman-9316

Last seen 4.5 years ago

Seattle, WA, USA

Hi,

I’m having some difficulty with what would seem to be a pretty basic set-up. I have a panel of, say, 20 cell lines, c1, c2, …,c20, with two replicates for each cell line, so 40 samples total. The cell lines can be divided into groups based on different phenotypic properties and I’d like to do comparisons between groups. As an example, based on one phenotype the cell lines segregate into 4 groups, with c1-c5 in g1, c6-10 in g2, c11-15 in g3, c16-20 in g4. And I’d like to compare, say, g1 with g2, which would be 10 samples (2 replicates each of 5 cell lines) in each group; then compare g1 with g3, etc.

What is the best way to model this in DESeq2?

The approach I’ve been taking is to use all samples in the model and capture the replicate relationships using cell line labels,

cold <- data.frame(cells = c(“c1”, “c1”, “c2”, “c2”, …”c20”, “c20”))
dds <- DESeqDataSetFromMatrix(counts, cold, formula(~ cells))
DEOut <- DESeq(dds)

...and then use contrasts to group the cell lines for a given comparison. Since resultsNames(DEOut) gives elements like

“Intercept”
“cells_c2_vs_c1” 
“cells_c3_vs_c1” 
…

to compare g2 with g3, I’d do

results(DEOut, contrast = list(c(“cells_c6_vs_c1”, “cells_c7_vs_c1”,...“cells_c10_vs_c1”),
                               c(“cells_c11_vs_c1”, “cells_c12_vs_c1”,...“cells_c15_vs_c1”)))

Is there something not recommended here? The reasons I’m suspicious:

A particular comparison like the above turns up 7000+ regulated genes at an adjusted p-value threshold of 0.01, with extreme p-values and fold changes: returned adjusted p-values actually reach 0, and absolute log2FoldChanges go as high as 60. If I instead model by using group labels (g1, g2, g3, g4) for the samples (and therefore ignore the cell line replicate relationships), then comparing g2 with g3 results in fewer than 500 regulated genes, and absolute log2FoldChanges going as high as 25.
Spot-checking the regulated genes seems to indicate a dependence on the expression of the reference cell line c1 that I don’t want. That is, the genes that are up-regulated (log2FoldChange > 0) not only roughly have g2 average expression higher than g3 average expression (which is what I want), but even more apparent is that the average expression of the g2 cell lines is higher than the individual cell line expression for c1. Likewise, the trend for down-regulated genes (log2FoldChange < 0) is that the average expression of the g2 cell lines is lower than for c1.

Should I perhaps be using a model without an Intercept term?

Thanks in advance for the help!

Bob

sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux

Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] DESeq2_1.16.1              SummarizedExperiment_1.6.3
 [3] DelayedArray_0.2.7         matrixStats_0.52.2        
 [5] Biobase_2.36.2             GenomicRanges_1.28.4      
 [7] GenomeInfoDb_1.12.2        IRanges_2.10.2            
 [9] S4Vectors_0.14.3           BiocGenerics_0.22.0

deseq2 rnaseq differential gene expression • 1.3k views

ADD COMMENT • link updated 6.5 years ago by Gavin Kelly ▴ 680 • written 6.5 years ago by Bob Thurman • 0

score 0 · Answer 1 · 2017-10-19

That formulation appears to be estimating (when you call DESeq) the replicate variability, but not the between-lines-within-phenotype variability. The contrast is then comparing the averages (across lines within the phenotype), but is ignoring the variability of the lines within the phenotype. It's an ingenious approach, but flawed because it ignores this important source of variability.

What I believe is the 'approved' approach is described in the DESeq2 vignette, under the section "Group-specific condition effects, individuals nested within groups" (but you don't have a 'condition'). You'd need to relabel each line with an index within the particular phenotype (so you'd have line A within group G1, line A within group G2... line J within group G1, lineJ within group G2 ), and a design of ~group + group:line_within_group (the 'pairing' you might worry about is arbitrary, and has no effect because of this design). When the groups were unbalanced in terms of number of lines, you're going to get un-estimatable coefficients. I don't think there's a way around that - you'd either sacrifice them or have an arbitrary scheme of pooling them with other lines (increasing the 'replicate' variability).

Alternatively you could use voom + limma, where you have the ability to have a random-effect due to cell-to-cell variation, by using the duplicateCorrelation function therein.