Design formula and Design matrix in DESeq
3
2
Entering edit mode
AurelieMLB ▴ 50
@aureliemlb-6978
Last seen 9.6 years ago
United Kingdom

Hello,

I am new to RNAseq, new to DESeq2 and new to writing design formulas/matrix (not a statistician)...

I am trying to understand the meaning of the different examples I find on the internet and so far I am far from a clear understanding. Could someone help please? Is there a documentation for dummies that I could look. I have looked at DESeq2 and Limma documentation so far. I do get that the formula to use really depends on the question i want to answer but still not clear... Here are some of my questions:

1 - I have 27 samples. Two different treatments + control and 3 different times. I have 3 replicates for each. I would like to know what are the genes differentially expressed at each time for each treatment versus control (e.g. treated versus control at time 1, treated versus control at time 4 ...etc). I also would like to know what are the genes differentially expressed at each time for one treatment versus the other.

So far I have used: ~time*treatment

From DESeq2, I get the following column names for the design:

resultsNames(dds)
 [1] "Intercept"                  "time1"
 [3] "time2"                      "time3"
 [5] "treatmentControl"              "treatment1"
 [7] "treatment2"        "time1.treatmentControl"
 [9] "time2.treatmentControl"        "time3.treatmentControl"
[11] "time1.treatment1"  "time2.treatment1"
[13] "time3.treatment1" "time1.treatment2"
[15] "time2.treatment2"  "time3.treatment2"

In order to get more insight on what was happening I tried the same formula but using design<- model.matrix(~time*treatment,meta ) and the colnames I have are completely different, I just get:

colnames(design)
[1] "(Intercept)"                "time2"
[3] "time3"                     "treatment1"
[5] "treatment2"        "time3:treatment1"
[7] "time3:treatment1" "time2:treatment2"
[9] "time3:treatment2"

I am missing every column related to control and time 1.

I am lost. What DESeq2 is doing in the background please? Is it changing the formula? Is it possible to access the design matrix used by DESeq2?

 

2 - In the Limma documentation (http://www.statsci.org/smyth/pubs/limma-biocbook-reprint.pdf, p413), there is an example for a 2 x 2 factorial design.

FileName   Strain   Treatment

File1          WT         U

File2    WT    S

File3   Mu     U

File4   Mu     S

File5   Mu     S

TS <- paste(targets$Strain, targets$Treatment, sep = ".") 

TS <- factor(TS, levels = c("WT.U", "WT.S", "Mu.U", "Mu.S")) 

design <- model.matrix(~0 + TS)

colnames(design) <- levels(TS)

fit <- lmFit(eset, design)

cont.matrix <- makeContrasts(

+ WT.SvsU = WT.S - WT.U,

+ Mu.SvsU = Mu.S - Mu.U,

+ Diff = (Mu.S - Mu.U) - (WT.S - WT.U),

+ levels = design)

 fit2 <- contrasts.

 Is it possible to do the same thing in DESeq2 (i.e. to merge the two factors in one)? It does seem more intuitive this way for the subsequent comparisons. But does it really gives the same results as ~strain * Treatment as they seems to explain? Could I do the same with my time*treatment experiment?

3 - Looking at this example for instance: Yet another DESeq2 nested design contrast matrix question

sample

stage

region

animal

E1_1

E

Region1

1

E1_2

E

Region1

2

E1_3

E

Region1

3

E2_1

E

Region2

1

E2_2

E

Region2

2

E2_3

E

Region2

7

P1_1

P

Region1

4

P1_2

P

Region1

5

P1_3

P

Region1

6

P2_1

P

Region2

4

P2_2

P

Region2

5

P2_3

P

Region2

8

Darya tried design = ~ Animals + Stage + Region and design = ~ Animals + Stage + Region + Stage:Region

But Mickeal Love suggested

dds$animal.nested <- factor(c(1,2,3,1,2,4,1,2,3,1,2,4))

design =~ stage + stage:animal + stage:region

What was wrong with Darya ones please?

Why not use design=~stage *animal *region so that all possible comparisons are available and then only use the constrasts needed?

How does the formula impact on what is calculated in DESeq2 before doing any contrast please?

 

Thanks in advance for any help!

 

 

rnaseq deseq2 formula Limma • 26k views
ADD COMMENT
2
Entering edit mode
AurelieMLB ▴ 50
@aureliemlb-6978
Last seen 9.6 years ago
United Kingdom

Hello,

For those like me who needs to familiarise themselves with what formulas in R means, I found those links:

- http://stackoverflow.com/questions/13366755/what-does-the-r-formula-y1-mean

- http://cran.r-project.org/doc/manuals/R-intro.html#Formulae-for-statistical-models

I found this helpful.

cheers

ADD COMMENT
1
Entering edit mode
@mikelove
Last seen 5 days ago
United States

"I have 3 replicates for each. I would like to know what are the genes differentially expressed at each time for each treatment versus control (e.g. treated versus control at time 1, treated versus control at time 4 ...etc). I also would like to know what are the genes differentially expressed at each time for one treatment versus the other."

see this answer I wrote for a similar question:

A: DESEq2 comparison with mulitple cell types under 2 conditions

ADD COMMENT
0
Entering edit mode

Thank you very much!

ADD REPLY
0
Entering edit mode

Hello,

Following what I understood from the example your were pointing at, I tried to group my time and treatment in a single factor "group". e.g.:

sample  treatment time         group
s1 treatment1 1 treatment1.1
s2 treatment2 1 treatment2.1
s3 control 1 control.1

...

(in triplicate and for different time points)
 

My understanding from the Limma doc and from the discussion here, is that it would give me the same thing as ~ time*treatment. But it really does not:

- with ~group the DESeq function converges with the default parameters. with ~time*treatment, it does not, I have this error:"4 rows did not converge in beta, labelled in mcols(object)$betaConv. Use larger maxit argument with nbinomWaldTest." So I have use maxit=200.

-The LogFold changes are really different when I do the contrasts. It is nearly as if there is no shrinkage at all when I use ~time*treatment : the Log Fold change range is much wider (compared to ~group)

-I systematically find more differentially expressed genes with  ~ time*treatment.

-the adjusted p-values are well correlated but not exactly the same.

I am probably doing something horribly wrong here. Would you have any advice/ideas please?

Here is my code for both:

# Per group
#---------------

meta$group<-as.factor(meta$group)

 # dds object
dds<-DESeqDataSetFromMatrix (
  countData = counts,
  colData = meta,
  design = ~group) 

dds <- DESeq(dds) # converge !!
resultsNames(dds)

res=results(dds, contrast=c("group",treatment1.1,control.1),addMLE=TRUE)

# ~time*treatment
#------------------
meta$treatment <- factor(meta$treatment, levels= c("control","treatment1","treatment2"))
meta$time <- factor(meta$time, levels= c(1,2,3))

 # dds object
dds<-DESeqDataSetFromMatrix (
  countData = counts,
  colData = meta,
  design = formula(~time*treatment)) 

#dds <- DESeq(dds) #-->Warning: "4 rows did not converge in beta, labelled in mcols(object)$betaConv. Use larger maxit argument with nbinomWaldTest"
dds <- estimateSizeFactors(dds)
dds <- estimateDispersions(dds)
dds <- nbinomWaldTest(dds, maxit=200)
resultsNames(dds)

comp<- list(c("treatmenttreatment1" ,"time1.treatmenttreatment1" ), c("treatmentcontrol", "time1.treatmentcontrol"))
 
res = results(dds, contrast=comp)

--> are my contrasts wrong for instance? What I am trying to learn is what are the genes that are differentially expressed at time 1 in the treated sample versus the control.

Many thanks in advance for any advice or comment !

 

 

ADD REPLY
1
Entering edit mode

Just use the ~group design.

The shrinkage differences on designs with interaction terms is actually detailed in both our help files, vignette and in the Methods section of our paper, but I'd prefer if you follow the advice I directed you to earlier. This makes constructing contrasts easier.

ADD REPLY
0
Entering edit mode

Many thanks for your answer. I think you are touching the core of my question here. I did read your paper and the vignettes. I see which paragraph you are talking about. But for me both formulas, ~group and ~time*treatment (and the associated contrasts I am doing) seem to take the interactions into account... But here, it seems to me that you are saying that it is not the case and that the way they are treated are different ...? I really apologize if I am missing the obvious.

ADD REPLY
1
Entering edit mode

So then you know from those sections that main effects are not moderated in designs with interactions. In your contrast, you are adding the interaction effects (moderated) to main effects (not moderated) and that is why the resulting LFC are wider than with ~ group where all effects are moderated. So the inference is not exactly the same with the two different designs. And I recommend you use ~group. With betaPrior=FALSE, the inference for the two designs is identical.

ADD REPLY
0
Entering edit mode

This is very helpful. Thank you very much for your time. 

ADD REPLY
1
Entering edit mode
@gordon-smyth
Last seen 7 hours ago
WEHI, Melbourne, Australia

"But does it really gives the same results as ~strain * Treatment as they seems to explain? Could I do the same with my time*treatment experiment?"

Yes, it really does give the same results as estimating an interaction. Yes, you could do the same with your experiment.

Do you know that you can use limma to analyse RNA-seq data directly? Here' s a short start-to-finish example:

  http://bioinf.wehi.edu.au/RNAseqCaseStudy/

ADD COMMENT
0
Entering edit mode

Thanks a lot for this! I am trying both DESeq2 and Limma actually to see if I get similar results.

May I ask you how/where you learnt about formulas and designs? Is there something I can do to get a better understanding and be able to understand this by myself rather than having to ask questions on forum for each new case :)

ADD REPLY

Login before adding your answer.

Traffic: 571 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6