Question

DESeq2 design formula - how to account for donor and /or sample?

0

Entering edit mode

C. Janssen • 0

@c-janssen-15831

Last seen 5.9 years ago

Hi everyone,

I would like to use DESeq2 to analyse RNAseq data from 2 different cell populations from 8 donors (positive vs negative T-cells).

Please note, I’m a beginner, so my apologies for any mistakes!

I searched through different posts and the DESeq2 vignette - and I think I'm on the right track, but I’d like to ask for some advice on my project, in particular the design formula.

My coldata includes a unique code for each sample, the donor number, the cell population and patient info:

 head.DataTable(samplesheet, n=16)
   Sample Donor Population RNAseqBatch     Study        Number   Gender  Birthdate
1      1A    D1        CD8           1     03/0174      166779      f    07-06-1989
2      2A    D2        CD8           1     03/0620      168739      f    14-11-1991
3      3A    D3        CD8           1     03/0293      167373      f    15-sep-87
4      4A    D4        CD8           1     01/0839      168505      f    18-okt-89
5      5A    D5        CD8           1     01/0202      166296      f    08-dec-88
6      6A    D6        CD8           1     03/0105      166322      f    18-okt-89
7      7A    D7        CD8           1     01/0894      168770      m    21-jan-90
8      8A    D8        CD8           1     01/0781      168366      f    08-okt-89

9      1B    D1        CD8pos        1     03/0174      166779      f    07-06-1989
10     2B    D2        CD8pos        1     03/0620      168739      f    14-11-1991
11     3B    D3        CD8pos        1     03/0293      167373      f    15-sep-87
12     4B    D4        CD8pos        1     01/0839      168505      f    18-okt-89
13     5B    D5        CD8pos        1     01/0202      166296      f    08-dec-88
14     6B    D6        CD8pos        1     03/0105      166322      f    18-okt-89
15     7B    D7        CD8pos        1     01/0894      168770      m    21-jan-90
16     8B    D8        CD8pos        1     01/0781      168366      f    08-okt-89

We want to detect the differentially expressed genes in CD8pos vs CD8.

So I made my DESeqDataSet:

counts <- read.delim("FG1703_All_mapped_reads2.txt", header = TRUE, row.names = 1)

samplesheet <- read.delim("S1samplesheetanalysis3.txt", header = TRUE, row.names = NULL)


ddsPD  <- DESeqDataSetFromMatrix(countData = counts, colData = samplesheet,
                              design = ~ Population + Donor )

My reasoning here is that I can now calculate the effect of cell population, accounting for the fact that each donor has sample pairing (2 or 3 cell populations per donor). But is this design correct this way? We get quite a lot of genes this way:

#Results 

ddsPD <- DESeq(ddsPD)

resPD <- results(ddsPD)

>head(resPD)

>resultsNames(ddsPD)

 [1] "Intercept"             "Population_CD8pos2_vs_CD8" "Population_CD8pos_vs_CD8"

[4] "Donor_D2_vs_D1"        "Donor_D3_vs_D1"        "Donor_D4_vs_D1"      

[7] "Donor_D5_vs_D1"        "Donor_D6_vs_D1"        "Donor_D7_vs_D1"      

[10] "Donor_D8_vs_D1"


# Then contrasting the cell populations

resPD2 <- results(ddsPD, contrast=c("Population", "CD8pos", "CD8"), alpha = 0.05, lfcThreshold=1)

table(resPD2$padj < 0.05)

>summary(resPD2)

out of 33246 with nonzero total read count

adjusted p-value < 0.05

LFC > 0 (up)     : 353, 1.1%

LFC < 0 (down)   : 1, 0.003%

#outliers [1]     : 0, 0%

#low counts [2]   : 14181, 43%

#(mean count < 7)


>resPD2

log2 fold change (MLE): Population CD8pos vs CD8

wald test p-value: Population CD8pos vs CD8

Is my design correct using: formula(~ Population + Donor)? Do I need to include ‘Sample’ in the formula as well? Or is this not necessary? I’m still in doubt.

If yes, when I add ‘Sample’ to the formula, I get the ‘Model matrix not full rank’ error. Should I then use the ‘Model matrix not full rank’ method?

Thank you for any advice you can provide!

Kind regards,

C. Janssen

> sessionInfo()

R version 3.5.0 (2018-04-23)

Platform: x86_64-w64-mingw32/x64 (64-bit)

Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

attached base packages:

[1] parallel stats4 stats graphics grDevices utils datasets

[8] methods base

other attached packages:

[1] ggplot2_2.2.1 dplyr_0.7.4

[3] DESeq2_1.20.0 SummarizedExperiment_1.10.1

[5] DelayedArray_0.6.0 BiocParallel_1.14.1

[7] matrixStats_0.53.1 Biobase_2.40.0

[9] GenomicRanges_1.32.2 GenomeInfoDb_1.16.0

[11] IRanges_2.14.9 S4Vectors_0.18.1

[13] BiocGenerics_0.26.0 BiocInstaller_1.30.0

deseq2 multiple factor design rna-seq • 1.6k views

ADD COMMENT • link updated 5.9 years ago by swbarnes2 ★ 1.3k • written 5.9 years ago by C. Janssen • 0

score 1 · Answer 1 · 2018-05-16

1

Entering edit mode

swbarnes2 ★ 1.3k

@swbarnes2-14086

Last seen 3 hours ago

San Diego

Do not add "sample" to the design. You can't add anything to the design if it's different in every single sample.

ADD COMMENT • link 5.9 years ago swbarnes2 ★ 1.3k

0

Entering edit mode

Ah I didn't know that! Thanks for your response.

ADD REPLY • link 5.9 years ago C. Janssen • 0

0

Entering edit mode

For this I'd use a design of ~donor + population (it's typical in R/Bioconductor to put the variable of interest at the end of the design formula, although the models are mathematically equivalent). This means, each donor has a baseline level, and then there is a coefficient (an LFC) which will be estimated for the difference between populations, controlling for the donor baselines.