Question: DESeq2 design formula - how to account for donor and /or sample?
gravatar for C. Janssen
6 days ago by
C. Janssen0
C. Janssen0 wrote:

Hi everyone,

I would like to use DESeq2 to analyse RNAseq data from 2 different cell populations from 8 donors (positive vs negative T-cells).

Please note, I’m a beginner, so my apologies for any mistakes!

I searched through different posts and the DESeq2 vignette - and I think I'm on the right track, but I’d like to ask for some advice on my project, in particular the design formula.


My coldata includes a unique code for each sample, the donor number, the cell population and patient info:

 head.DataTable(samplesheet, n=16)
   Sample Donor Population RNAseqBatch     Study        Number   Gender  Birthdate
1      1A    D1        CD8           1     03/0174      166779      f    07-06-1989
2      2A    D2        CD8           1     03/0620      168739      f    14-11-1991
3      3A    D3        CD8           1     03/0293      167373      f    15-sep-87
4      4A    D4        CD8           1     01/0839      168505      f    18-okt-89
5      5A    D5        CD8           1     01/0202      166296      f    08-dec-88
6      6A    D6        CD8           1     03/0105      166322      f    18-okt-89
7      7A    D7        CD8           1     01/0894      168770      m    21-jan-90
8      8A    D8        CD8           1     01/0781      168366      f    08-okt-89
9      1B    D1        CD8pos        1     03/0174      166779      f    07-06-1989
10     2B    D2        CD8pos        1     03/0620      168739      f    14-11-1991
11     3B    D3        CD8pos        1     03/0293      167373      f    15-sep-87
12     4B    D4        CD8pos        1     01/0839      168505      f    18-okt-89
13     5B    D5        CD8pos        1     01/0202      166296      f    08-dec-88
14     6B    D6        CD8pos        1     03/0105      166322      f    18-okt-89
15     7B    D7        CD8pos        1     01/0894      168770      m    21-jan-90
16     8B    D8        CD8pos        1     01/0781      168366      f    08-okt-89


We want to detect the differentially expressed genes in CD8pos vs CD8.

So I made my DESeqDataSet:

counts <- read.delim("FG1703_All_mapped_reads2.txt", header = TRUE, row.names = 1)
samplesheet <- read.delim("S1samplesheetanalysis3.txt", header = TRUE, row.names = NULL)

ddsPD  <- DESeqDataSetFromMatrix(countData = counts, colData = samplesheet,
                              design = ~ Population + Donor )

My reasoning here is that I can now calculate the effect of cell population, accounting for the fact that each donor has sample pairing (2 or 3 cell populations per donor). But is this design correct this way? We get quite a lot of genes this way:


ddsPD <- DESeq(ddsPD)

resPD <- results(ddsPD)



 [1] "Intercept"             "Population_CD8pos2_vs_CD8" "Population_CD8pos_vs_CD8"

[4] "Donor_D2_vs_D1"        "Donor_D3_vs_D1"        "Donor_D4_vs_D1"      

[7] "Donor_D5_vs_D1"        "Donor_D6_vs_D1"        "Donor_D7_vs_D1"      

[10] "Donor_D8_vs_D1"

# Then contrasting the cell populations

resPD2 <- results(ddsPD, contrast=c("Population", "CD8pos", "CD8"), alpha = 0.05, lfcThreshold=1)

table(resPD2$padj < 0.05)


out of 33246 with nonzero total read count

adjusted p-value < 0.05

LFC > 0 (up)     : 353, 1.1%

LFC < 0 (down)   : 1, 0.003%

#outliers [1]     : 0, 0%

#low counts [2]   : 14181, 43%

#(mean count < 7)


log2 fold change (MLE): Population CD8pos vs CD8

wald test p-value: Population CD8pos vs CD8


Is my design correct using:  formula(~ Population + Donor)? Do I need to include ‘Sample’ in the formula as well? Or is this not necessary? I’m still in doubt.

If yes, when I add ‘Sample’ to the formula, I get the ‘Model matrix not full rank’ error. Should I then use the ‘Model matrix not full rank’ method?


Thank you for any advice you can provide!

Kind regards,

C. Janssen



> sessionInfo()

R version 3.5.0 (2018-04-23)

Platform: x86_64-w64-mingw32/x64 (64-bit)

Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default


attached base packages:

[1] parallel  stats4    stats     graphics  grDevices utils     datasets

[8] methods   base    


other attached packages:

 [1] ggplot2_2.2.1               dplyr_0.7.4               

 [3] DESeq2_1.20.0               SummarizedExperiment_1.10.1

 [5] DelayedArray_0.6.0          BiocParallel_1.14.1       

 [7] matrixStats_0.53.1          Biobase_2.40.0            

 [9] GenomicRanges_1.32.2        GenomeInfoDb_1.16.0        

[11] IRanges_2.14.9              S4Vectors_0.18.1          

[13] BiocGenerics_0.26.0         BiocInstaller_1.30.0 

ADD COMMENTlink modified 6 days ago by swbarnes210 • written 6 days ago by C. Janssen0
gravatar for swbarnes2
6 days ago by
swbarnes210 wrote:

Do not add "sample" to the design.  You can't add anything to the design if it's different in every single sample.

ADD COMMENTlink written 6 days ago by swbarnes210

Ah I didn't know that! Thanks for your response.

ADD REPLYlink written 6 days ago by C. Janssen0

For this I'd use a design of ~donor + population (it's typical in R/Bioconductor to put the variable of interest at the end of the design formula, although the models are mathematically equivalent). This means, each donor has a baseline level, and then there is a coefficient (an LFC) which will be estimated for the difference between populations, controlling for the donor baselines.

ADD REPLYlink modified 5 days ago • written 5 days ago by Michael Love17k

Thanks Michael! I'm glad to have confirmation of the design. 

ADD REPLYlink written 4 days ago by C. Janssen0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 315 users visited in the last hour