Hello,

I am new to RNAseq, new to DESeq2 and new to writing design formulas/matrix (not a statistician)...

I am trying to understand the meaning of the different examples I find on the internet and so far I am far from a clear understanding. Could someone help please? Is there a documentation for dummies that I could look. I have looked at DESeq2 and Limma documentation so far. I do get that the formula to use really depends on the question i want to answer but still not clear... Here are some of my questions:

1 - I have 27 samples. Two different treatments + control and 3 different times. I have 3 replicates for each. I would like to know what are the genes differentially expressed at each time for each treatment versus control (e.g. treated versus control at time 1, treated versus control at time 4 ...etc). I also would like to know what are the genes differentially expressed at each time for one treatment versus the other.

So far I have used: ~time*treatment

From DESeq2, I get the following column names for the design:

`resultsNames(dds)`

[1] "Intercept" "time1"

[3] "time2" "time3"

[5] "treatmentControl" "treatment1"

[7] "treatment2" "time1.treatmentControl"

[9] "time2.treatmentControl" "time3.treatmentControl"

[11] "time1.treatment1" "time2.treatment1"

[13] "time3.treatment1" "time1.treatment2"

[15] "time2.treatment2" "time3.treatment2"

In order to get more insight on what was happening I tried the same formula but using` design<- model.matrix(~time*treatment,meta )`

and the colnames I have are completely different, I just get:

`colnames(design)`

[1] "(Intercept)" "time2"

[3] "time3" "treatment1"

[5] "treatment2" "time3:treatment1"

[7] "time3:treatment1" "time2:treatment2"

[9] "time3:treatment2"

I am missing every column related to control and time 1.

I am lost. What DESeq2 is doing in the background please? Is it changing the formula? Is it possible to access the design matrix used by DESeq2?

2 - In the Limma documentation (http://www.statsci.org/smyth/pubs/limma-biocbook-reprint.pdf, p413), there is an example for a 2 x 2 factorial design.

`FileName Strain Treatment`

`File1 WT U`

`File2 WT S`

`File3 Mu U`

`File4 Mu S`

`File5 Mu S`

`TS <- paste(targets$Strain, targets$Treatment, sep = ".") `

`TS <- factor(TS, levels = c("WT.U", "WT.S", "Mu.U", "Mu.S")) `

`design <- model.matrix(~0 + TS)`

`colnames(design) <- levels(TS)`

`fit <- lmFit(eset, design)`

`cont.matrix <- makeContrasts(`

`+ WT.SvsU = WT.S - WT.U,`

`+ Mu.SvsU = Mu.S - Mu.U,`

`+ Diff = (Mu.S - Mu.U) - (WT.S - WT.U),`

`+ levels = design)`

` fit2 <- contrasts.`

Is it possible to do the same thing in DESeq2 (i.e. to merge the two factors in one)? It does seem more intuitive this way for the subsequent comparisons. But does it really gives the same results as ~strain * Treatment as they seems to explain? Could I do the same with my time*treatment experiment?

3 - Looking at this example for instance: Yet another DESeq2 nested design contrast matrix question

sample |
stage |
region |
animal |

E1_1 |
E |
Region1 |
1 |

E1_2 |
E |
Region1 |
2 |

E1_3 |
E |
Region1 |
3 |

E2_1 |
E |
Region2 |
1 |

E2_2 |
E |
Region2 |
2 |

E2_3 |
E |
Region2 |
7 |

P1_1 |
P |
Region1 |
4 |

P1_2 |
P |
Region1 |
5 |

P1_3 |
P |
Region1 |
6 |

P2_1 |
P |
Region2 |
4 |

P2_2 |
P |
Region2 |
5 |

P2_3 |
P |
Region2 |
8 |

Darya tried `design = ~ Animals + Stage + Region and design = ~ Animals + Stage + Region + Stage:Region`

But Mickeal Love suggested

`dds$animal.nested <- factor(c(1,2,3,1,2,4,1,2,3,1,2,4))`

`design =~ stage + stage:animal + stage:region`

What was wrong with Darya ones please?

Why not use design=~stage *animal *region so that all possible comparisons are available and then only use the constrasts needed?

How does the formula impact on what is calculated in DESeq2 before doing any contrast please?

Thanks in advance for any help!

Thank you very much!

Hello,

Following what I understood from the example your were pointing at, I tried to group my time and treatment in a single factor "group". e.g.:

`sample treatment time group`

s1 treatment1 1 treatment1.1

s2 treatment2 1 treatment2.1

s3 control 1 control.1

`...`

(in triplicate and for different time points)

My understanding from the Limma doc and from the discussion here, is that it would give me the same thing as ~ time*treatment. But it really does not:

- with ~group the DESeq function converges with the default parameters. with ~time*treatment, it does not, I have this error:"

`4 rows did not converge in beta, labelled in mcols(object)$betaConv. Use larger maxit argument with nbinomWaldTest." So I have use maxit=200.`

-The LogFold changes are really different when I do the contrasts. It is nearly as if there is no shrinkage at all when I use ~time*treatment : the Log Fold change range is much wider (compared to ~group)

-I systematically find more differentially expressed genes with ~ time*treatment.

-the adjusted p-values are well correlated but not exactly the same.

I am probably doing something horribly wrong here. Would you have any advice/ideas please?

Here is my code for both:

`# Per group`

#---------------

`meta$group<-as.factor(meta$group)`

`# dds object`

dds<-DESeqDataSetFromMatrix (

countData = counts,

colData = meta,

design = ~group)

`dds <- DESeq(dds) # converge !!`

resultsNames(dds)

`res=results(dds, contrast=c("group",treatment1.1,control.1),addMLE=TRUE)`

`# ~time*treatment`

#------------------

meta$treatment <- factor(meta$treatment, levels= c("control","treatment1","treatment2"))

meta$time <- factor(meta$time, levels= c(1,2,3))

`# dds object`

dds<-DESeqDataSetFromMatrix (

countData = counts,

colData = meta,

design = formula(~time*treatment))

`#dds <- DESeq(dds) #-->Warning: "4 rows did not converge in beta, labelled in mcols(object)$betaConv. Use larger maxit argument with nbinomWaldTest"`

dds <- estimateSizeFactors(dds)

dds <- estimateDispersions(dds)

dds <- nbinomWaldTest(dds, maxit=200)

resultsNames(dds)

`comp<- list(c("treatmenttreatment1" ,"time1.treatmenttreatment1" ), c("treatmentcontrol", "time1.treatmentcontrol"))`

res = results(dds, contrast=comp)

--> are my contrasts wrong for instance? What I am trying to learn is what are the genes that are differentially expressed at time 1 in the treated sample versus the control.

Many thanks in advance for any advice or comment !

Just use the ~group design.

The shrinkage differences on designs with interaction terms is actually detailed in both our help files, vignette and in the Methods section of our paper, but I'd prefer if you follow the advice I directed you to earlier. This makes constructing contrasts easier.

Many thanks for your answer. I think you are touching the core of my question here. I did read your paper and the vignettes. I see which paragraph you are talking about. But for me both formulas, ~group and ~time*treatment (and the associated contrasts I am doing) seem to take the interactions into account... But here, it seems to me that you are saying that it is not the case and that the way they are treated are different ...? I really apologize if I am missing the obvious.

So then you know from those sections that main effects are not moderated in designs with interactions. In your contrast, you are adding the interaction effects (moderated) to main effects (not moderated) and that is why the resulting LFC are wider than with ~ group where all effects are moderated. So the inference is not exactly the same with the two different designs. And I recommend you use ~group. With betaPrior=FALSE, the inference for the two designs is identical.

This is very helpful. Thank you very much for your time.