Search
Question: DESeq2 design formula - how to account for donor and /or sample?
0
gravatar for C. Janssen
8 weeks ago by
C. Janssen0
C. Janssen0 wrote:

Hi everyone,

I would like to use DESeq2 to analyse RNAseq data from 2 different cell populations from 8 donors (positive vs negative T-cells).

Please note, I’m a beginner, so my apologies for any mistakes!

I searched through different posts and the DESeq2 vignette - and I think I'm on the right track, but I’d like to ask for some advice on my project, in particular the design formula.

 

My coldata includes a unique code for each sample, the donor number, the cell population and patient info:

 head.DataTable(samplesheet, n=16)
   Sample Donor Population RNAseqBatch     Study        Number   Gender  Birthdate
1      1A    D1        CD8           1     03/0174      166779      f    07-06-1989
2      2A    D2        CD8           1     03/0620      168739      f    14-11-1991
3      3A    D3        CD8           1     03/0293      167373      f    15-sep-87
4      4A    D4        CD8           1     01/0839      168505      f    18-okt-89
5      5A    D5        CD8           1     01/0202      166296      f    08-dec-88
6      6A    D6        CD8           1     03/0105      166322      f    18-okt-89
7      7A    D7        CD8           1     01/0894      168770      m    21-jan-90
8      8A    D8        CD8           1     01/0781      168366      f    08-okt-89
9      1B    D1        CD8pos        1     03/0174      166779      f    07-06-1989
10     2B    D2        CD8pos        1     03/0620      168739      f    14-11-1991
11     3B    D3        CD8pos        1     03/0293      167373      f    15-sep-87
12     4B    D4        CD8pos        1     01/0839      168505      f    18-okt-89
13     5B    D5        CD8pos        1     01/0202      166296      f    08-dec-88
14     6B    D6        CD8pos        1     03/0105      166322      f    18-okt-89
15     7B    D7        CD8pos        1     01/0894      168770      m    21-jan-90
16     8B    D8        CD8pos        1     01/0781      168366      f    08-okt-89

 

We want to detect the differentially expressed genes in CD8pos vs CD8.

So I made my DESeqDataSet:

counts <- read.delim("FG1703_All_mapped_reads2.txt", header = TRUE, row.names = 1)
samplesheet <- read.delim("S1samplesheetanalysis3.txt", header = TRUE, row.names = NULL)


ddsPD  <- DESeqDataSetFromMatrix(countData = counts, colData = samplesheet,
                              design = ~ Population + Donor )

My reasoning here is that I can now calculate the effect of cell population, accounting for the fact that each donor has sample pairing (2 or 3 cell populations per donor). But is this design correct this way? We get quite a lot of genes this way:

#Results 

ddsPD <- DESeq(ddsPD)

resPD <- results(ddsPD)

>head(resPD)

>resultsNames(ddsPD)

 [1] "Intercept"             "Population_CD8pos2_vs_CD8" "Population_CD8pos_vs_CD8"

[4] "Donor_D2_vs_D1"        "Donor_D3_vs_D1"        "Donor_D4_vs_D1"      

[7] "Donor_D5_vs_D1"        "Donor_D6_vs_D1"        "Donor_D7_vs_D1"      

[10] "Donor_D8_vs_D1"


# Then contrasting the cell populations

resPD2 <- results(ddsPD, contrast=c("Population", "CD8pos", "CD8"), alpha = 0.05, lfcThreshold=1)

table(resPD2$padj < 0.05)

>summary(resPD2)

out of 33246 with nonzero total read count

adjusted p-value < 0.05

LFC > 0 (up)     : 353, 1.1%

LFC < 0 (down)   : 1, 0.003%

#outliers [1]     : 0, 0%

#low counts [2]   : 14181, 43%

#(mean count < 7)


>resPD2

log2 fold change (MLE): Population CD8pos vs CD8

wald test p-value: Population CD8pos vs CD8

 

Is my design correct using:  formula(~ Population + Donor)? Do I need to include ‘Sample’ in the formula as well? Or is this not necessary? I’m still in doubt.

If yes, when I add ‘Sample’ to the formula, I get the ‘Model matrix not full rank’ error. Should I then use the ‘Model matrix not full rank’ method?

 

Thank you for any advice you can provide!

Kind regards,

C. Janssen

 

 

> sessionInfo()

R version 3.5.0 (2018-04-23)

Platform: x86_64-w64-mingw32/x64 (64-bit)

Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

 

attached base packages:

[1] parallel  stats4    stats     graphics  grDevices utils     datasets

[8] methods   base    

 

other attached packages:

 [1] ggplot2_2.2.1               dplyr_0.7.4               

 [3] DESeq2_1.20.0               SummarizedExperiment_1.10.1

 [5] DelayedArray_0.6.0          BiocParallel_1.14.1       

 [7] matrixStats_0.53.1          Biobase_2.40.0            

 [9] GenomicRanges_1.32.2        GenomeInfoDb_1.16.0        

[11] IRanges_2.14.9              S4Vectors_0.18.1          

[13] BiocGenerics_0.26.0         BiocInstaller_1.30.0 

ADD COMMENTlink modified 8 weeks ago by swbarnes230 • written 8 weeks ago by C. Janssen0
1
gravatar for swbarnes2
8 weeks ago by
swbarnes230
swbarnes230 wrote:

Do not add "sample" to the design.  You can't add anything to the design if it's different in every single sample.

ADD COMMENTlink written 8 weeks ago by swbarnes230

Ah I didn't know that! Thanks for your response.

ADD REPLYlink written 8 weeks ago by C. Janssen0

For this I'd use a design of ~donor + population (it's typical in R/Bioconductor to put the variable of interest at the end of the design formula, although the models are mathematically equivalent). This means, each donor has a baseline level, and then there is a coefficient (an LFC) which will be estimated for the difference between populations, controlling for the donor baselines.

ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by Michael Love18k

Thanks Michael! I'm glad to have confirmation of the design. 

ADD REPLYlink written 8 weeks ago by C. Janssen0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 274 users visited in the last hour