Question

DESeq2 multifactor experiment design

0

Entering edit mode

rclaire ▴ 20

@rclaire-7689

Last seen 7.3 years ago

United States

Hello,

I am using DESeq2 on R (version 3.2.2) for analysis of small RNA expression data. I have some questions about how to set-up the experiment to answer the questions I am interested in.

I have a total of 80 samples. This consists of 4 different stages (ages) of fish embryo. At each stage, there are 5 treatments - this includes a control, 2 anoxic exposure treatments, and 2 aerobic recovery exposure treatments. n=4 for each stage/treatment combination. Here is table of the data:

Stage	Treatment	(treatment description)
D2	A	Control
D2	A	Control
D2	A	Control
D2	A	Control
D2	B	Early anoxia
D2	B	Early anoxia
D2	B	Early anoxia
D2	B	Early anoxia
D2	C	Late anoxia
D2	C	Late anoxia
D2	C	Late anoxia
D2	C	Late anoxia
D2	D	Early recovery
D2	D	Early recovery
D2	D	Early recovery
D2	D	Early recovery
D2	E	Late recovery
D2	E	Late recovery
D2	E	Late recovery
D2	E	Late recovery
4DPD	A	Control
4DPD	A	Control
4DPD	A	Control
4DPD	A	Control
4DPD	B	Early anoxia
4DPD	B	Early anoxia
4DPD	B	Early anoxia
4DPD	B	Early anoxia
4DPD	C	Late anoxia
4DPD	C	Late anoxia
4DPD	C	Late anoxia
4DPD	C	Late anoxia
4DPD	D	Early recovery
4DPD	D	Early recovery
4DPD	D	Early recovery
4DPD	D	Early recovery
4DPD	E	Late recovery
4DPD	E	Late recovery
4DPD	E	Late recovery
4DPD	E	Late recovery
12DPD	A	Control
12DPD	A	Control
12DPD	A	Control
12DPD	A	Control
12DPD	B	Early anoxia
12DPD	B	Early anoxia
12DPD	B	Early anoxia
12DPD	B	Early anoxia
12DPD	C	Late anoxia
12DPD	C	Late anoxia
12DPD	C	Late anoxia
12DPD	C	Late anoxia
12DPD	D	Early recovery
12DPD	D	Early recovery
12DPD	D	Early recovery
12DPD	D	Early recovery
12DPD	E	Late recovery
12DPD	E	Late recovery
12DPD	E	Late recovery
12DPD	E	Late recovery
20DPD	A	Control
20DPD	A	Control
20DPD	A	Control
20DPD	A	Control
20DPD	B	Early anoxia
20DPD	B	Early anoxia
20DPD	B	Early anoxia
20DPD	B	Early anoxia
20DPD	C	Late anoxia
20DPD	C	Late anoxia
20DPD	C	Late anoxia
20DPD	C	Late anoxia
20DPD	D	Early recovery
20DPD	D	Early recovery
20DPD	D	Early recovery
20DPD	D	Early recovery
20DPD	E	Late recovery
20DPD	E	Late recovery
20DPD	E	Late recovery
20DPD	E	Late recovery

My goals are to (1) identify smRNAs (genes) differentially expressed in response to anoxia/recovery at each stage (i.e. within D2, within 4dpd...); (2) be able to compare expression of a small RNA in one stage to another stage (i.e. between D2 and 4dpd).

To do this, I want to enter all samples together so I can normalize and filter the samples together. This will keep them on the same scale for comparison. However, I only want to run comparisons and generate p-values within each stage, with LRT. Is this possible?

From what I've read (Desq2 vignette, other posts), it seems I need to perform some combination of LRT and interaction terms. I have come up with the following commands:

dds <- DESeqDataSetFromMatrix(countData=Anox_countData_sample,colData=colData,design = ~ stage+treatment+stage:treatment)

dds$group <- factor(paste0(dds$stage,dds$treatment))

design(dds)<-~group

dds <-DESeq(dds, test="LRT", reduced = ~1, full= ~ group)

resultsNames(dds) # note: I have not been able to get to this part, so I don't know what options will come up

results(dds, contrast=c("group","D2A","D2B","D2C","D2D","D2E")) #not sure about this part, see above

Here are my questions:

1. Is this the correct approach for my question?

2. My understanding is that contrasts only pulls out the log2FC. Is there some other command I am missing to call for the p-value within the specific stage?

3. On controls: each stage has its own control (untreated). When I used the word "control" in coldata, I receive the following error:

> dds <-DESeqDataSetFromMatrix(countData=Anox_countData_sample,colData=colData,design = ~ stage+treatment+stage:treatment)
it appears that the last variable in the design formula, 'treatment',
has a factor level, 'control', which is not the reference level. we recommend
to use factor(...,levels=...) or relevel() to set this as the reference level
before proceeding. for more information, please see the 'Note on factor levels'
in vignette('DESeq2').

I don't actually want control set as the reference level. I just want a p-value for LRT (like an ANOVA) at each stage. Is the best way to do this to just change the treatments to letters? A-E?

4. I tried this on a subset of my data and received the following error:

Error in estimateSizeFactorsForMatrix(counts(object), locfunc = locfunc, :
every gene contains at least one zero, cannot compute log geometric means

I am not surprised that there is at least one zero in each gene, since this includes all stages and they are biologically distinct. How can I work with this?

5. I am starting with 10 GB of data, about 13 million sequences. If I run all 80 samples together, as I am suggesting, it seems appropriate to pre-filter. When should this be done?

deseq2 LRT multifactor contrast interactions • 5.7k views

ADD COMMENT • link 8.5 years ago rclaire ▴ 20

score 1 · Answer 1 · 2016-01-24

hi,

With respect to "differentially expressed in response to anoxia/recovery at each stage", can you be more specific? Of the 5 treatments within each stage, are you interested in any difference across these treatments, or in specific pair-wise comparisons, or something else? I can give a more concrete recommendation on what design to use, etc, if you can explain what comparison(s) you are interested in within each stage.

2) When you use the 'contrast' argument, this should generate a new p-value associated with that contrast by performing a Wald test, and this will be indicated in the first two lines printed at the top of the results table (whether the p-value is a LRT or Wald test p-value).

3) It doesn't matter so much which level is set as the reference level, you can make any inference afterward. The only reason for this warning is so that users who don't exactly specify the contrast they want to extract get the result they desire (often fold change of treatment/control), and not control/treatment.

4) You can use an alternative size factor estimator, see Wolfgang and my answers to a similar question here:

A: Deseq2 error estimating size factors -- zeros in count matrix

5) If you like, you can prefilter the dds object just to reduce running time. For example, you can require that the row sum be larger than 5 counts:

dds <- dds[ rowSums(counts(dds)) > 5, ]

These genes have such low counts that there is no power to detect differential expression.

score 0 · Answer 2 · 2016-01-24

0

Entering edit mode

rclaire ▴ 20

@rclaire-7689

Last seen 7.3 years ago

United States

Hi Michael,

Thanks for the quick and helpful reply.

As for the comparisons within each stage I am interested in any differences across the treatments, not pairwise comparisons. I basically want to run an ANOVA for each stage, but having filtered and normalized the whole data set (all stages) together first.

Just to clarify, for contrast, is it possible to set it to generate the LRT p-value? Since I am interested in any differences across the treatments. Or?

The rest of the answers make sense to me. Thank you!

ADD COMMENT • link 8.5 years ago rclaire ▴ 20

1

Entering edit mode

I think the easiest thing would be to test the LRT on subsets of the data. You can still normalize and filter the data all together though.

dds <- estimateSizeFactors(dds)
mnc <- rowMeans(counts(dds, normalized=TRUE))
dds <- dds[ mnc > threshold, ]

Within each stage, you have enough residual degrees of freedom to estimate the dispersions for each stage separately in my opinion (20 samples - 5 group means = 15 d.f.). This would look like:

dds.stage1 <- dds[ , dds$stage == "1" ]
dds.stage1 <- DESeq(dds.stage1, test="LRT", full=~treatment, reduced=~1)
res <- results(dds.stage1, independentFiltering=FALSE)

It is possible to perform LRT on the whole object, but this would require you to form the model matrix yourself outside of DESeq() and then manually remove columns associated with the treatment effects for a given stage that you want to test. The above approach is much simpler.

If you want to compare within a treatment across stage, build a new factor for the full dds object:

dds$group <- factor(paste0(dds$stage, dds$treatment))
dds.full <- DESeq(dds, design=~group)
resultsNames(dds)

Then use results() and the contrast argument to compare different levels of 'group':

results(dds, contrast=c("group","stage1treatmentX", "stage2treatmentX")

ADD REPLY • link 8.5 years ago Michael Love 42k

score 0 · Answer 3 · 2016-02-01

Hi Michael,

Thank you, that's been very helpful. I want to check on something about the filtering.

Since what I am doing is essentially pre-filtering, shouldn't I still keep the independentFiltering, instead of having it set to =FALSE in results? Or am I missing something?

Here's how I've filtered:

For my 80 samples combined, I have about 11 million rows (sequences) originally. I ran the following:

dds <- DESeqDataSetFromMatrix(countData = Anox_countData,colData=colData,design =   ~treatment)
dds <- estimateSizeFactors(dds)
rowSum <- rowSums(counts(dds, normalized=TRUE))
dds <- dds[ rowSum > 4 ]

I chose to filter on rowSum > 4 because I have so many unique stages/treatments each with 4 biological replicates. With this filter, if there is a scenario in which a sequence occurs 1 count (normalized) in each of 4 libraries of the same stage/treatment, but 0 times in any other libraries, it would still be kept.

This reduced the data set from ~11 million reads to 725,773 reads.

Given that information, do you think it is still appropriate to leave the "independentFiltering=FALSE" setting?

Thanks for your help!

Claire