Question: DESeq2 multifactor experiment design
0
gravatar for rclaire
3.7 years ago by
rclaire0
United States
rclaire0 wrote:

Hello,

 

I am using DESeq2  on R (version 3.2.2) for analysis of small RNA expression data. I have some questions about how to set-up the experiment to answer the questions I am interested in. 

I have a total of 80 samples. This consists of 4 different stages (ages) of fish embryo. At each stage, there are 5 treatments - this includes a control, 2 anoxic exposure treatments, and 2 aerobic recovery exposure treatments. n=4 for each stage/treatment combination. Here is table of the data:

 

Stage

Treatment

(treatment description)

D2

A

Control

D2

A

Control

D2

A

Control

D2

A

Control

D2

B

Early anoxia

D2

B

Early anoxia

D2

B

Early anoxia

D2

B

Early anoxia

D2

C

Late anoxia

D2

C

Late anoxia

D2

C

Late anoxia

D2

C

Late anoxia

D2

D

Early recovery

D2

D

Early recovery

D2

D

Early recovery

D2

D

Early recovery

D2

E

Late recovery

D2

E

Late recovery

D2

E

Late recovery

D2

E

Late recovery

4DPD

A

Control

4DPD

A

Control

4DPD

A

Control

4DPD

A

Control

4DPD

B

Early anoxia

4DPD

B

Early anoxia

4DPD

B

Early anoxia

4DPD

B

Early anoxia

4DPD

C

Late anoxia

4DPD

C

Late anoxia

4DPD

C

Late anoxia

4DPD

C

Late anoxia

4DPD

D

Early recovery

4DPD

D

Early recovery

4DPD

D

Early recovery

4DPD

D

Early recovery

4DPD

E

Late recovery

4DPD

E

Late recovery

4DPD

E

Late recovery

4DPD

E

Late recovery

12DPD

A

Control

12DPD

A

Control

12DPD

A

Control

12DPD

A

Control

12DPD

B

Early anoxia

12DPD

B

Early anoxia

12DPD

B

Early anoxia

12DPD

B

Early anoxia

12DPD

C

Late anoxia

12DPD

C

Late anoxia

12DPD

C

Late anoxia

12DPD

C

Late anoxia

12DPD

D

Early recovery

12DPD

D

Early recovery

12DPD

D

Early recovery

12DPD

D

Early recovery

12DPD

E

Late recovery

12DPD

E

Late recovery

12DPD

E

Late recovery

12DPD

E

Late recovery

20DPD

A

Control

20DPD

A

Control

20DPD

A

Control

20DPD

A

Control

20DPD

B

Early anoxia

20DPD

B

Early anoxia

20DPD

B

Early anoxia

20DPD

B

Early anoxia

20DPD

C

Late anoxia

20DPD

C

Late anoxia

20DPD

C

Late anoxia

20DPD

C

Late anoxia

20DPD

D

Early recovery

20DPD

D

Early recovery

20DPD

D

Early recovery

20DPD

D

Early recovery

20DPD

E

Late recovery

20DPD

E

Late recovery

20DPD

E

Late recovery

20DPD

E

Late recovery

 

My goals are to (1) identify smRNAs (genes) differentially expressed in response to anoxia/recovery at each stage (i.e. within D2, within 4dpd...); (2) be able to compare expression of a small RNA in one stage to another stage (i.e. between D2 and 4dpd). 

To do this, I want to enter all samples together so I can normalize and filter the samples together. This will keep them on the same scale for comparison. However, I only want to run comparisons and generate p-values within each stage, with LRT. Is this possible? 

From what I've read (Desq2 vignette, other posts), it seems I need to perform some combination of LRT and interaction terms. I have come up with the following commands:

dds <- DESeqDataSetFromMatrix(countData=Anox_countData_sample,colData=colData,design = ~ stage+treatment+stage:treatment)

dds$group <- factor(paste0(dds$stage,dds$treatment)) 

design(dds)<-~group

dds <-DESeq(dds, test="LRT", reduced = ~1, full= ~ group) 

resultsNames(dds) # note: I have not been able to get to this part, so I don't know what options will come up

results(dds, contrast=c("group","D2A","D2B","D2C","D2D","D2E")) #not sure about this part, see above

Here are my questions:

1. Is this the correct approach for my question?

2. My understanding is that contrasts only pulls out the log2FC. Is there some other command I am missing to call for the p-value within the specific stage? 

3. On controls: each stage has its own control (untreated). When I used the word "control" in coldata, I receive the following error:

> dds <-DESeqDataSetFromMatrix(countData=Anox_countData_sample,colData=colData,design = ~ stage+treatment+stage:treatment)
it appears that the last variable in the design formula, 'treatment',
  has a factor level, 'control', which is not the reference level. we recommend
  to use factor(...,levels=...) or relevel() to set this as the reference level
  before proceeding. for more information, please see the 'Note on factor levels'
  in vignette('DESeq2').

I don't actually want control set as the reference level. I just want a p-value for LRT (like an ANOVA) at each stage. Is the best way to do this to just change the treatments to letters? A-E?

4. I tried this on a subset of my data and received the following error:

Error in estimateSizeFactorsForMatrix(counts(object), locfunc = locfunc,  : 
  every gene contains at least one zero, cannot compute log geometric means

I am not surprised that there is at least one zero in each gene, since this includes all stages and they are biologically distinct. How can I work with this?

5. I am starting with 10 GB of data, about 13 million sequences. If I run all 80 samples together, as I am suggesting, it seems appropriate to pre-filter. When should this be done?

ADD COMMENTlink modified 3.7 years ago • written 3.7 years ago by rclaire0
Answer: DESeq2 multifactor experiment design
0
gravatar for Michael Love
3.7 years ago by
Michael Love25k
United States
Michael Love25k wrote:

hi,

With respect to "differentially expressed in response to anoxia/recovery at each stage", can you be more specific? Of the 5 treatments within each stage, are you interested in any difference across these treatments, or in specific pair-wise comparisons, or something else? I can give a more concrete recommendation on what design to use, etc, if you can explain what comparison(s) you are interested in within each stage.

2) When you use the 'contrast' argument, this should generate a new p-value associated with that contrast by performing a Wald test, and this will be indicated in the first two lines printed at the top of the results table (whether the p-value is a LRT or Wald test p-value). 

3) It doesn't matter so much which level is set as the reference level, you can make any inference afterward. The only reason for this warning is so that users who don't exactly specify the contrast they want to extract get the result they desire (often fold change of treatment/control), and not control/treatment.

4) You can use an alternative size factor estimator, see Wolfgang and my answers to a similar question here: 

A: Deseq2 error estimating size factors -- zeros in count matrix

5) If you like, you can prefilter the dds object just to reduce running time. For example, you can require that the row sum be larger than 5 counts:

dds <- dds[ rowSums(counts(dds)) > 5, ]

These genes have such low counts that there is no power to detect differential expression.

ADD COMMENTlink written 3.7 years ago by Michael Love25k
Answer: DESeq2 multifactor experiment design
0
gravatar for rclaire
3.7 years ago by
rclaire0
United States
rclaire0 wrote:

Hi Michael, 

Thanks for the quick and helpful reply. 

As for the comparisons within each stage I am interested in any differences across the treatments, not pairwise comparisons. I basically want to run an ANOVA for each stage, but having filtered and normalized the whole data set (all stages) together first.

Just to clarify, for contrast, is it possible to set it to generate the LRT p-value? Since I am interested in any differences across the treatments. Or?

The rest of the answers make sense to me. Thank you!

 

ADD COMMENTlink written 3.7 years ago by rclaire0

I think the easiest thing would be to test the LRT on subsets of the data. You can still normalize and filter the data all together though.

dds <- estimateSizeFactors(dds)
mnc <- rowMeans(counts(dds, normalized=TRUE))
dds <- dds[ mnc > threshold, ]

Within each stage, you have enough residual degrees of freedom to estimate the dispersions for each stage separately in my opinion (20 samples - 5 group means = 15 d.f.). This would look like:

dds.stage1 <- dds[ , dds$stage == "1" ]
dds.stage1 <- DESeq(dds.stage1, test="LRT", full=~treatment, reduced=~1)
res <- results(dds.stage1, independentFiltering=FALSE)

It is possible to perform LRT on the whole object, but this would require you to form the model matrix yourself outside of DESeq() and then manually remove columns associated with the treatment effects for a given stage that you want to test. The above approach is much simpler.

If you want to compare within a treatment across stage, build a new factor for the full dds object:

dds$group <- factor(paste0(dds$stage, dds$treatment))
dds.full <- DESeq(dds, design=~group)
resultsNames(dds)

Then use results() and the contrast argument to compare different levels of 'group':

results(dds, contrast=c("group","stage1treatmentX", "stage2treatmentX")
ADD REPLYlink written 3.7 years ago by Michael Love25k
Answer: DESeq2 multifactor experiment design
0
gravatar for rclaire
3.7 years ago by
rclaire0
United States
rclaire0 wrote:

Hi Michael, 

Thank you, that's been very helpful. I want to check on something about the filtering. 

Since what I am doing is essentially pre-filtering, shouldn't I still keep the independentFiltering, instead of having it set to =FALSE in results? Or am I missing something?

Here's how I've filtered:

For my 80 samples combined, I have about 11 million rows (sequences) originally. I ran the following:

dds <- DESeqDataSetFromMatrix(countData = Anox_countData,colData=colData,design =   ~treatment)
dds <- estimateSizeFactors(dds)
rowSum <- rowSums(counts(dds, normalized=TRUE))
dds <- dds[ rowSum > 4 ]

I chose to filter on rowSum > 4 because I have so many unique stages/treatments each with 4 biological replicates. With this filter, if there is a scenario in which a sequence occurs 1 count (normalized) in each of 4 libraries of the same stage/treatment, but 0 times in any other libraries, it would still be kept.

This reduced the data set from ~11 million reads to 725,773 reads.  

Given that information, do you think it is still appropriate to leave the "independentFiltering=FALSE" setting? 

Thanks for your help!

Claire

 

 

ADD COMMENTlink written 3.7 years ago by rclaire0

You can turn on or off independent filtering for calls to results(), this is up to you. It is fine to use independent filtering after pre-filtering; if there is no extra benefit to filtering, then the automatic independent filtering will not raise the threshold at all. Independent filtering will pick a different threshold for each call to results(). The reason I suggested in code above to turn it off was from your initial question when you said "...so I can normalize and filter the samples together...". I thought that implied you wanted to set a single filter yourself, and have calls to results() not raise the threshold.

ADD REPLYlink written 3.7 years ago by Michael Love25k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 200 users visited in the last hour