Question: Design formula and Design matrix in DESeq
1
4.1 years ago by
AurelieMLB40
United Kingdom
AurelieMLB40 wrote:

Hello,

I am new to RNAseq, new to DESeq2 and new to writing design formulas/matrix (not a statistician)...

I am trying to understand the meaning of the different examples I find on the internet and so far I am far from a clear understanding. Could someone help please? Is there a documentation for dummies that I could look. I have looked at DESeq2 and Limma documentation so far. I do get that the formula to use really depends on the question i want to answer but still not clear... Here are some of my questions:

1 - I have 27 samples. Two different treatments + control and 3 different times. I have 3 replicates for each. I would like to know what are the genes differentially expressed at each time for each treatment versus control (e.g. treated versus control at time 1, treated versus control at time 4 ...etc). I also would like to know what are the genes differentially expressed at each time for one treatment versus the other.

So far I have used: ~time*treatment

From DESeq2, I get the following column names for the design:

resultsNames(dds)  [1] "Intercept"                  "time1"  [3] "time2"                      "time3"  [5] "treatmentControl"              "treatment1"  [7] "treatment2"        "time1.treatmentControl"  [9] "time2.treatmentControl"        "time3.treatmentControl" [11] "time1.treatment1"  "time2.treatment1" [13] "time3.treatment1" "time1.treatment2" [15] "time2.treatment2"  "time3.treatment2"

In order to get more insight on what was happening I tried the same formula but using design<- model.matrix(~time*treatment,meta ) and the colnames I have are completely different, I just get:

colnames(design) [1] "(Intercept)"                "time2" [3] "time3"                     "treatment1" [5] "treatment2"        "time3:treatment1" [7] "time3:treatment1" "time2:treatment2" [9] "time3:treatment2"

I am missing every column related to control and time 1.

I am lost. What DESeq2 is doing in the background please? Is it changing the formula? Is it possible to access the design matrix used by DESeq2?

2 - In the Limma documentation (http://www.statsci.org/smyth/pubs/limma-biocbook-reprint.pdf, p413), there is an example for a 2 x 2 factorial design.

FileName   Strain   Treatment

File1          WT         U

File2    WT    S

File3   Mu     U

File4   Mu     S

File5   Mu     S

TS <- paste(targets$Strain, targets$Treatment, sep = ".")

TS <- factor(TS, levels = c("WT.U", "WT.S", "Mu.U", "Mu.S"))

design <- model.matrix(~0 + TS)

colnames(design) <- levels(TS)

fit <- lmFit(eset, design)

cont.matrix <- makeContrasts(

+ WT.SvsU = WT.S - WT.U,

+ Mu.SvsU = Mu.S - Mu.U,

+ Diff = (Mu.S - Mu.U) - (WT.S - WT.U),

+ levels = design)

 fit2 <- contrasts.

Is it possible to do the same thing in DESeq2 (i.e. to merge the two factors in one)? It does seem more intuitive this way for the subsequent comparisons. But does it really gives the same results as ~strain * Treatment as they seems to explain? Could I do the same with my time*treatment experiment?

3 - Looking at this example for instance: Yet another DESeq2 nested design contrast matrix question

 sample stage region animal E1_1 E Region1 1 E1_2 E Region1 2 E1_3 E Region1 3 E2_1 E Region2 1 E2_2 E Region2 2 E2_3 E Region2 7 P1_1 P Region1 4 P1_2 P Region1 5 P1_3 P Region1 6 P2_1 P Region2 4 P2_2 P Region2 5 P2_3 P Region2 8

Darya tried design = ~ Animals + Stage + Region and design = ~ Animals + Stage + Region + Stage:Region

But Mickeal Love suggested

dds$animal.nested <- factor(c(1,2,3,1,2,4,1,2,3,1,2,4)) design =~ stage + stage:animal + stage:region What was wrong with Darya ones please? Why not use design=~stage *animal *region so that all possible comparisons are available and then only use the constrasts needed? How does the formula impact on what is calculated in DESeq2 before doing any contrast please? Thanks in advance for any help! rnaseq limma deseq2 formula • 12k views ADD COMMENTlink modified 4.1 years ago • written 4.1 years ago by AurelieMLB40 Answer: Design formula and Design matrix in DESeq 2 4.1 years ago by AurelieMLB40 United Kingdom AurelieMLB40 wrote: Hello, For those like me who needs to familiarise themselves with what formulas in R means, I found those links: I found this helpful. cheers ADD COMMENTlink written 4.1 years ago by AurelieMLB40 Answer: Design formula and Design matrix in DESeq 1 4.1 years ago by Michael Love24k United States Michael Love24k wrote: "I have 3 replicates for each. I would like to know what are the genes differentially expressed at each time for each treatment versus control (e.g. treated versus control at time 1, treated versus control at time 4 ...etc). I also would like to know what are the genes differentially expressed at each time for one treatment versus the other." see this answer I wrote for a similar question: A: DESEq2 comparison with mulitple cell types under 2 conditions ADD COMMENTlink written 4.1 years ago by Michael Love24k Thank you very much! ADD REPLYlink written 4.1 years ago by AurelieMLB40 Hello, Following what I understood from the example your were pointing at, I tried to group my time and treatment in a single factor "group". e.g.: sample treatment time group s1 treatment1 1 treatment1.1 s2 treatment2 1 treatment2.1 s3 control 1 control.1 ... (in triplicate and for different time points) My understanding from the Limma doc and from the discussion here, is that it would give me the same thing as ~ time*treatment. But it really does not: - with ~group the DESeq function converges with the default parameters. with ~time*treatment, it does not, I have this error:"4 rows did not converge in beta, labelled in mcols(object)$betaConv. Use larger maxit argument with nbinomWaldTest." So I have use maxit=200.

-The LogFold changes are really different when I do the contrasts. It is nearly as if there is no shrinkage at all when I use ~time*treatment : the Log Fold change range is much wider (compared to ~group)

-I systematically find more differentially expressed genes with  ~ time*treatment.

-the adjusted p-values are well correlated but not exactly the same.

I am probably doing something horribly wrong here. Would you have any advice/ideas please?

Here is my code for both:

# Per group #---------------

meta$group<-as.factor(meta$group)

 # dds object dds<-DESeqDataSetFromMatrix (   countData = counts,   colData = meta,   design = ~group) 

dds <- DESeq(dds) # converge !! resultsNames(dds)

res=results(dds, contrast=c("group",treatment1.1,control.1),addMLE=TRUE)

# ~time*treatment #------------------ meta$treatment <- factor(meta$treatment, levels= c("control","treatment1","treatment2")) meta$time <- factor(meta$time, levels= c(1,2,3))

 # dds object dds<-DESeqDataSetFromMatrix (   countData = counts,   colData = meta,   design = formula(~time*treatment)) 

#dds <- DESeq(dds) #-->Warning: "4 rows did not converge in beta, labelled in mcols(object)\$betaConv. Use larger maxit argument with nbinomWaldTest" dds <- estimateSizeFactors(dds) dds <- estimateDispersions(dds) dds <- nbinomWaldTest(dds, maxit=200) resultsNames(dds)

comp<- list(c("treatmenttreatment1" ,"time1.treatmenttreatment1" ), c("treatmentcontrol", "time1.treatmentcontrol"))   res = results(dds, contrast=comp)

--> are my contrasts wrong for instance? What I am trying to learn is what are the genes that are differentially expressed at time 1 in the treated sample versus the control.

1

Just use the ~group design.

The shrinkage differences on designs with interaction terms is actually detailed in both our help files, vignette and in the Methods section of our paper, but I'd prefer if you follow the advice I directed you to earlier. This makes constructing contrasts easier.

Many thanks for your answer. I think you are touching the core of my question here. I did read your paper and the vignettes. I see which paragraph you are talking about. But for me both formulas, ~group and ~time*treatment (and the associated contrasts I am doing) seem to take the interactions into account... But here, it seems to me that you are saying that it is not the case and that the way they are treated are different ...? I really apologize if I am missing the obvious.

1

So then you know from those sections that main effects are not moderated in designs with interactions. In your contrast, you are adding the interaction effects (moderated) to main effects (not moderated) and that is why the resulting LFC are wider than with ~ group where all effects are moderated. So the inference is not exactly the same with the two different designs. And I recommend you use ~group. With betaPrior=FALSE, the inference for the two designs is identical.

Answer: Design formula and Design matrix in DESeq
1
4.1 years ago by
Gordon Smyth37k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth37k wrote:

"But does it really gives the same results as ~strain * Treatment as they seems to explain? Could I do the same with my time*treatment experiment?"

Yes, it really does give the same results as estimating an interaction. Yes, you could do the same with your experiment.

Do you know that you can use limma to analyse RNA-seq data directly? Here' s a short start-to-finish example:

Thanks a lot for this! I am trying both DESeq2 and Limma actually to see if I get similar results.

May I ask you how/where you learnt about formulas and designs? Is there something I can do to get a better understanding and be able to understand this by myself rather than having to ask questions on forum for each new case :)